In [1]:
# Taxi demand prediction in New York City

In [1]:
#Importing Libraries
# pip3 install graphviz
#pip3 install dask
#pip3 install toolz
#pip3 install cloudpickle
# https://www.youtube.com/watch?v=ieW3G7ZzRZ0
# https://github.com/dask/dask-tutorial
# please do go through this python notebook: https://github.com/dask/dask-tutorial/blob/master/07_dataframe.ipynb
import dask.dataframe as dd#similar to pandas

import pandas as pd#pandas to create small dataframes 

# pip3 install foliun
# if this doesnt work refere install_folium.JPG in drive
import folium #open street map

# unix time: https://www.unixtimestamp.com/
import datetime #Convert to unix time

import time #Convert to unix time

# if numpy is not installed already : pip3 install numpy
import numpy as np#Do aritmetic operations on arrays

# matplotlib: used to plot graphs
import matplotlib
# matplotlib.use('nbagg') : matplotlib uses this protocall which makes plots more user intractive like zoom in and zoom out
matplotlib.use('nbagg')
import matplotlib.pylab as plt
import seaborn as sns#Plots
from matplotlib import rcParams#Size of plots  

# this lib is used while we calculate the stight line distance between two (lat,lon) pairs in miles
import gpxpy.geo #Get the haversine distance

from sklearn.cluster import MiniBatchKMeans, KMeans#Clustering
import math
import pickle
import os

# download migwin: https://mingw-w64.org/doku.php/download/mingw-builds
# install it in your system and keep the path, migw_path ='installed path'
mingw_path = 'C:\\Program Files\\mingw-w64\\x86_64-5.3.0-posix-seh-rt_v4-rev0\\mingw64\\bin'
os.environ['PATH'] = mingw_path + ';' + os.environ['PATH']

# to install xgboost: pip3 install xgboost
# if it didnt happen check install_xgboost.JPG
import xgboost as xgb

# to install sklearn: pip install -U scikit-learn
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
import warnings
warnings.filterwarnings("ignore")

Data Information

Ge the data from : http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml (2016 data) The data used in the attached datasets were collected and provided to the NYC Taxi and Limousine Commission (TLC)

Information on taxis:

Yellow Taxi: Yellow Medallion Taxicabs

These are the famous NYC yellow taxis that provide transportation exclusively through street-hails. The number of taxicabs is limited by a finite number of medallions issued by the TLC. You access this mode of transportation by standing in the street and hailing an available taxi with your hand. The pickups are not pre-arranged.

For Hire Vehicles (FHVs)

FHV transportation is accessed by a pre-arrangement with a dispatcher or limo company. These FHVs are not permitted to pick up passengers via street hails, as those rides are not considered pre-arranged.

Green Taxi: Street Hail Livery (SHL)

The SHL program will allow livery vehicle owners to license and outfit their vehicles with green borough taxi branding, meters, credit card machines, and ultimately the right to accept street hails in addition to pre-arranged rides.

Credits: Quora

Footnote:
In the given notebook we are considering only the yellow taxis for the time period between Jan - Mar 2015 & Jan - Mar 2016

Data Collection

We Have collected all yellow taxi trips data from jan-2015 to dec-2016(Will be using only 2015 data)

file name file name size number of records number of features
yellow_tripdata_2016-01 1. 59G 10906858 19
yellow_tripdata_2016-02 1. 66G 11382049 19
yellow_tripdata_2016-03 1. 78G 12210952 19
yellow_tripdata_2016-04 1. 74G 11934338 19
yellow_tripdata_2016-05 1. 73G 11836853 19
yellow_tripdata_2016-06 1. 62G 11135470 19
yellow_tripdata_2016-07 884Mb 10294080 17
yellow_tripdata_2016-08 854Mb 9942263 17
yellow_tripdata_2016-09 870Mb 10116018 17
yellow_tripdata_2016-10 933Mb 10854626 17
yellow_tripdata_2016-11 868Mb 10102128 17
yellow_tripdata_2016-12 897Mb 10449408 17
yellow_tripdata_2015-01 1.84Gb 12748986 19
yellow_tripdata_2015-02 1.81Gb 12450521 19
yellow_tripdata_2015-03 1.94Gb 13351609 19
yellow_tripdata_2015-04 1.90Gb 13071789 19
yellow_tripdata_2015-05 1.91Gb 13158262 19
yellow_tripdata_2015-06 1.79Gb 12324935 19
yellow_tripdata_2015-07 1.68Gb 11562783 19
yellow_tripdata_2015-08 1.62Gb 11130304 19
yellow_tripdata_2015-09 1.63Gb 11225063 19
yellow_tripdata_2015-10 1.79Gb 12315488 19
yellow_tripdata_2015-11 1.65Gb 11312676 19
yellow_tripdata_2015-12 1.67Gb 11460573 19
In [3]:
#Looking at the features
# dask dataframe  : # https://github.com/dask/dask-tutorial/blob/master/07_dataframe.ipynb
month = dd.read_csv('yellow_tripdata_2015-01.csv')
print(month.columns)
Index(['VendorID', 'tpep_pickup_datetime', 'tpep_dropoff_datetime',
       'passenger_count', 'trip_distance', 'pickup_longitude',
       'pickup_latitude', 'RateCodeID', 'store_and_fwd_flag',
       'dropoff_longitude', 'dropoff_latitude', 'payment_type', 'fare_amount',
       'extra', 'mta_tax', 'tip_amount', 'tolls_amount',
       'improvement_surcharge', 'total_amount'],
      dtype='object')
In [4]:
# However unlike Pandas, operations on dask.dataframes don't trigger immediate computation, 
# instead they add key-value pairs to an underlying Dask graph. Recall that in the diagram below, 
# circles are operations and rectangles are results.

# to see the visulaization you need to install graphviz
# pip3 install graphviz if this doesnt work please check the install_graphviz.jpg in the drive
month.visualize()
Out[4]:
In [5]:
month.fare_amount.sum().visualize()
Out[5]:

Features in the dataset:

Field Name Description
VendorID A code indicating the TPEP provider that provided the record.
  1. Creative Mobile Technologies
  2. VeriFone Inc.
tpep_pickup_datetime The date and time when the meter was engaged.
tpep_dropoff_datetime The date and time when the meter was disengaged.
Passenger_count The number of passengers in the vehicle. This is a driver-entered value.
Trip_distance The elapsed trip distance in miles reported by the taximeter.
Pickup_longitude Longitude where the meter was engaged.
Pickup_latitude Latitude where the meter was engaged.
RateCodeID The final rate code in effect at the end of the trip.
  1. Standard rate
  2. JFK
  3. Newark
  4. Nassau or Westchester
  5. Negotiated fare
  6. Group ride
Store_and_fwd_flag This flag indicates whether the trip record was held in vehicle memory before sending to the vendor, aka “store and forward,” because the vehicle did not have a connection to the server. Y= store and forward trip N= not a store and forward trip
Dropoff_longitude Longitude where the meter was disengaged.
Dropoff_ latitude Latitude where the meter was disengaged.
Payment_type A numeric code signifying how the passenger paid for the trip.
  1. Credit card
  2. Cash
  3. No charge
  4. Dispute
  5. Unknown
  6. Voided trip
Fare_amount The time-and-distance fare calculated by the meter.
Extra Miscellaneous extras and surcharges. Currently, this only includes. the $0.50 and $1 rush hour and overnight charges.
MTA_tax 0.50 MTA tax that is automatically triggered based on the metered rate in use.
Improvement_surcharge 0.30 improvement surcharge assessed trips at the flag drop. the improvement surcharge began being levied in 2015.
Tip_amount Tip amount – This field is automatically populated for credit card tips.Cash tips are not included.
Tolls_amount Total amount of all tolls paid in trip.
Total_amount The total amount charged to passengers. Does not include cash tips.

ML Problem Formulation

Time-series forecasting and Regression


- To find number of pickups, given location cordinates(latitude and longitude) and time, in the query reigion and surrounding regions.

To solve the above we would be using data collected in Jan - Mar 2015 to predict the pickups in Jan - Mar 2016.

Performance metrics

  1. Mean Absolute percentage error.
  2. Mean Squared error.

Data Cleaning

In this section we will be doing univariate analysis and removing outlier/illegitimate values which may be caused due to some error

In [6]:
#table below shows few datapoints along with all our features
month.head(5)
Out[6]:
VendorID tpep_pickup_datetime tpep_dropoff_datetime passenger_count trip_distance pickup_longitude pickup_latitude RateCodeID store_and_fwd_flag dropoff_longitude dropoff_latitude payment_type fare_amount extra mta_tax tip_amount tolls_amount improvement_surcharge total_amount
0 2 2015-01-15 19:05:39 2015-01-15 19:23:42 1 1.59 -73.993896 40.750111 1 N -73.974785 40.750618 1 12.0 1.0 0.5 3.25 0.0 0.3 17.05
1 1 2015-01-10 20:33:38 2015-01-10 20:53:28 1 3.30 -74.001648 40.724243 1 N -73.994415 40.759109 1 14.5 0.5 0.5 2.00 0.0 0.3 17.80
2 1 2015-01-10 20:33:38 2015-01-10 20:43:41 1 1.80 -73.963341 40.802788 1 N -73.951820 40.824413 2 9.5 0.5 0.5 0.00 0.0 0.3 10.80
3 1 2015-01-10 20:33:39 2015-01-10 20:35:31 1 0.50 -74.009087 40.713818 1 N -74.004326 40.719986 2 3.5 0.5 0.5 0.00 0.0 0.3 4.80
4 1 2015-01-10 20:33:39 2015-01-10 20:52:58 1 3.00 -73.971176 40.762428 1 N -74.004181 40.742653 2 15.0 0.5 0.5 0.00 0.0 0.3 16.30

1. Pickup Latitude and Pickup Longitude

It is inferred from the source https://www.flickr.com/places/info/2459115 that New York is bounded by the location cordinates(lat,long) - (40.5774, -74.15) & (40.9176,-73.7004) so hence any cordinates not within these cordinates are not considered by us as we are only concerned with pickups which originate within New York.

In [7]:
# Plotting pickup cordinates which are outside the bounding box of New-York 
# we will collect all the points outside the bounding box of newyork city to outlier_locations
outlier_locations = month[((month.pickup_longitude <= -74.15) | (month.pickup_latitude <= 40.5774)| \
                   (month.pickup_longitude >= -73.7004) | (month.pickup_latitude >= 40.9176))]

# creating a map with the a base location
# read more about the folium here: http://folium.readthedocs.io/en/latest/quickstart.html

# note: you dont need to remember any of these, you dont need indeepth knowledge on these maps and plots

map_osm = folium.Map(location=[40.734695, -73.990372], tiles='Stamen Toner')

# we will spot only first 100 outliers on the map, plotting all the outliers will take more time
sample_locations = outlier_locations.get_partition(0)
#for i,j in sample_locations.iterrows():
#    if int(j['pickup_latitude']) != 0:
#        folium.Marker(list((j['pickup_latitude'],j['pickup_longitude']))).add_to(map_osm)
#map_osm

Observation:- As you can see above that there are some points just outside the boundary but there are a few that are in either South america, Mexico or Canada

2. Dropoff Latitude & Dropoff Longitude

It is inferred from the source https://www.flickr.com/places/info/2459115 that New York is bounded by the location cordinates(lat,long) - (40.5774, -74.15) & (40.9176,-73.7004) so hence any cordinates not within these cordinates are not considered by us as we are only concerned with dropoffs which are within New York.

In [8]:
# Plotting dropoff cordinates which are outside the bounding box of New-York 
# we will collect all the points outside the bounding box of newyork city to outlier_locations
outlier_locations = month[((month.dropoff_longitude <= -74.15) | (month.dropoff_latitude <= 40.5774)| \
                   (month.dropoff_longitude >= -73.7004) | (month.dropoff_latitude >= 40.9176))]

# creating a map with the a base location
# read more about the folium here: http://folium.readthedocs.io/en/latest/quickstart.html

# note: you dont need to remember any of these, you dont need indeepth knowledge on these maps and plots

map_osm = folium.Map(location=[40.734695, -73.990372], tiles='Stamen Toner')

# we will spot only first 100 outliers on the map, plotting all the outliers will take more time
sample_locations = outlier_locations.get_partition(0)

#for i,j in sample_locations.iterrows():
#    if int(j['pickup_latitude']) != 0:
#        folium.Marker(list((j['dropoff_latitude'],j['dropoff_longitude']))).add_to(map_osm)
#map_osm

Observation:- The observations here are similar to those obtained while analysing pickup latitude and longitude

3. Trip Durations:

According to NYC Taxi & Limousine Commision Regulations the maximum allowed trip duration in a 24 hour interval is 12 hours.

In [9]:
#The timestamps are converted to unix so as to get duration(trip-time) & speed also pickup-times in unix are used while binning 

# in out data we have time in the formate "YYYY-MM-DD HH:MM:SS" we convert thiss sting to python time formate and then into unix time stamp
# https://stackoverflow.com/a/27914405
def convert_to_unix(s):
    return time.mktime(datetime.datetime.strptime(s, "%Y-%m-%d %H:%M:%S").timetuple())



# we return a data frame which contains the columns
# 1.'passenger_count' : self explanatory
# 2.'trip_distance' : self explanatory
# 3.'pickup_longitude' : self explanatory
# 4.'pickup_latitude' : self explanatory
# 5.'dropoff_longitude' : self explanatory
# 6.'dropoff_latitude' : self explanatory
# 7.'total_amount' : total fair that was paid
# 8.'trip_times' : duration of each trip
# 9.'pickup_times : pickup time converted into unix time 
# 10.'Speed' : velocity of each trip
def return_with_trip_times(month):
    duration = month[['tpep_pickup_datetime','tpep_dropoff_datetime']].compute()
    #pickups and dropoffs to unix time
    duration_pickup = [convert_to_unix(x) for x in duration['tpep_pickup_datetime'].values]
    duration_drop = [convert_to_unix(x) for x in duration['tpep_dropoff_datetime'].values]
    #calculate duration of trips
    durations = (np.array(duration_drop) - np.array(duration_pickup))/float(60)

    #append durations of trips and speed in miles/hr to a new dataframe
    new_frame = month[['passenger_count','trip_distance','pickup_longitude','pickup_latitude','dropoff_longitude','dropoff_latitude','total_amount']].compute()
    
    new_frame['trip_times'] = durations
    new_frame['pickup_times'] = duration_pickup
    new_frame['Speed'] = 60*(new_frame['trip_distance']/new_frame['trip_times'])
    
    return new_frame

# print(frame_with_durations.head())
#  passenger_count	trip_distance	pickup_longitude	pickup_latitude	dropoff_longitude	dropoff_latitude	total_amount	trip_times	pickup_times	Speed
#   1                  1.59	      -73.993896        	40.750111    	-73.974785      	40.750618           	17.05   	 18.050000	1.421329e+09	5.285319
#   1               	3.30    	-74.001648      	40.724243   	-73.994415      	40.759109           	17.80   	19.833333	1.420902e+09	9.983193
#   1               	1.80     	-73.963341      	40.802788     	-73.951820      	40.824413           	10.80   	10.050000	1.420902e+09	10.746269
#   1               	0.50    	-74.009087      	40.713818    	-74.004326       	40.719986           	4.80    	1.866667	1.420902e+09	16.071429
#   1               	3.00    	-73.971176      	40.762428    	-74.004181      	40.742653           	16.30   	19.316667	1.420902e+09	9.318378
frame_with_durations = return_with_trip_times(month)
In [10]:
frame_with_durations.head()
Out[10]:
passenger_count trip_distance pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude total_amount trip_times pickup_times Speed
0 1 1.59 -73.993896 40.750111 -73.974785 40.750618 17.05 18.050000 1.421329e+09 5.285319
1 1 3.30 -74.001648 40.724243 -73.994415 40.759109 17.80 19.833333 1.420902e+09 9.983193
2 1 1.80 -73.963341 40.802788 -73.951820 40.824413 10.80 10.050000 1.420902e+09 10.746269
3 1 0.50 -74.009087 40.713818 -74.004326 40.719986 4.80 1.866667 1.420902e+09 16.071429
4 1 3.00 -73.971176 40.762428 -74.004181 40.742653 16.30 19.316667 1.420902e+09 9.318378
In [11]:
# the skewed box plot shows us the presence of outliers 
sns.boxplot(y="trip_times", data =frame_with_durations)
plt.show()
In [12]:
#calculating 0-100th percentile to find a the correct percentile value for removal of outliers
from tqdm import tqdm
for i in range(0,100,10):
    var =frame_with_durations["trip_times"].values
    var = np.sort(var,axis = None)
    print("{} percentile value is {}".format(i,var[int(len(var)*(float(i)/100))]))
print ("100 percentile value is ",var[-1])
0 percentile value is -1211.0166666666667
10 percentile value is 3.8333333333333335
20 percentile value is 5.383333333333334
30 percentile value is 6.816666666666666
40 percentile value is 8.3
50 percentile value is 9.95
60 percentile value is 11.866666666666667
70 percentile value is 14.283333333333333
80 percentile value is 17.633333333333333
90 percentile value is 23.45
100 percentile value is  548555.6333333333
In [13]:
#looking further from the 99th percecntile
for i in range(90,100):
    var =frame_with_durations["trip_times"].values
    var = np.sort(var,axis = None)
    print("{} percentile value is {}".format(i,var[int(len(var)*(float(i)/100))]))
print ("100 percentile value is ",var[-1])
90 percentile value is 23.45
91 percentile value is 24.35
92 percentile value is 25.383333333333333
93 percentile value is 26.55
94 percentile value is 27.933333333333334
95 percentile value is 29.583333333333332
96 percentile value is 31.683333333333334
97 percentile value is 34.46666666666667
98 percentile value is 38.71666666666667
99 percentile value is 46.75
100 percentile value is  548555.6333333333
In [14]:
#removing data based on our analysis and TLC regulations
frame_with_durations_modified=frame_with_durations[(frame_with_durations.trip_times>1) & (frame_with_durations.trip_times<720)]
In [15]:
#box-plot after removal of outliers
sns.boxplot(y="trip_times", data =frame_with_durations_modified)
plt.show()
In [16]:
#pdf of trip-times after removing the outliers
sns.FacetGrid(frame_with_durations_modified,size=6) \
      .map(sns.kdeplot,"trip_times") \
      .add_legend();
plt.show();
In [17]:
#converting the values to log-values to chec for log-normal
import math
frame_with_durations_modified['log_times']=[math.log(i) for i in frame_with_durations_modified['trip_times'].values]
In [18]:
#pdf of log-values
sns.FacetGrid(frame_with_durations_modified,size=6) \
      .map(sns.kdeplot,"log_times") \
      .add_legend();
plt.show();
In [19]:
#Q-Q plot for checking if trip-times is log-normal
import scipy
scipy.stats.probplot(frame_with_durations_modified['log_times'].values, plot=plt)
plt.show()

4. Speed

In [20]:
# check for any outliers in the data after trip duration outliers removed
# box-plot for speeds with outliers
frame_with_durations_modified['Speed'] = 60*(frame_with_durations_modified['trip_distance']/frame_with_durations_modified['trip_times'])
sns.boxplot(y="Speed", data =frame_with_durations_modified)
plt.show()
In [21]:
#calculating speed values at each percntile 0,10,20,30,40,50,60,70,80,90,100 
for i in range(0,100,10):
    var =frame_with_durations_modified["Speed"].values
    var = np.sort(var,axis = None)
    print("{} percentile value is {}".format(i,var[int(len(var)*(float(i)/100))]))
print("100 percentile value is ",var[-1])
0 percentile value is 0.0
10 percentile value is 6.409495548961425
20 percentile value is 7.80952380952381
30 percentile value is 8.929133858267717
40 percentile value is 9.98019801980198
50 percentile value is 11.06865671641791
60 percentile value is 12.286689419795222
70 percentile value is 13.796407185628745
80 percentile value is 15.963224893917962
90 percentile value is 20.186915887850468
100 percentile value is  192857142.85714284
In [22]:
#calculating speed values at each percntile 90,91,92,93,94,95,96,97,98,99,100
for i in range(90,100):
    var =frame_with_durations_modified["Speed"].values
    var = np.sort(var,axis = None)
    print("{} percentile value is {}".format(i,var[int(len(var)*(float(i)/100))]))
print("100 percentile value is ",var[-1])
90 percentile value is 20.186915887850468
91 percentile value is 20.91645569620253
92 percentile value is 21.752988047808763
93 percentile value is 22.721893491124263
94 percentile value is 23.844155844155843
95 percentile value is 25.182552504038775
96 percentile value is 26.80851063829787
97 percentile value is 28.84304932735426
98 percentile value is 31.591128254580514
99 percentile value is 35.7513566847558
100 percentile value is  192857142.85714284
In [23]:
#calculating speed values at each percntile 99.0,99.1,99.2,99.3,99.4,99.5,99.6,99.7,99.8,99.9,100
for i in np.arange(0.0, 1.0, 0.1):
    var =frame_with_durations_modified["Speed"].values
    var = np.sort(var,axis = None)
    print("{} percentile value is {}".format(99+i,var[int(len(var)*(float(99+i)/100))]))
print("100 percentile value is ",var[-1])
99.0 percentile value is 35.7513566847558
99.1 percentile value is 36.31084727468969
99.2 percentile value is 36.91470054446461
99.3 percentile value is 37.588235294117645
99.4 percentile value is 38.33035714285714
99.5 percentile value is 39.17580340264651
99.6 percentile value is 40.15384615384615
99.7 percentile value is 41.338301043219076
99.8 percentile value is 42.86631016042781
99.9 percentile value is 45.3107822410148
100 percentile value is  192857142.85714284
In [24]:
#removing further outliers based on the 99.9th percentile value
frame_with_durations_modified=frame_with_durations_modified[(frame_with_durations_modified.Speed>0) & (frame_with_durations_modified.Speed<45.31)]
In [25]:
#avg.speed of cabs in New-York
sum(frame_with_durations_modified["Speed"]) / float(len(frame_with_durations_modified["Speed"]))
Out[25]:
12.452320837813998

The avg speed in Newyork speed is 12.45miles/hr, so a cab driver can travel 2 miles per 10min on avg.

4. Trip Distance

In [26]:
# up to now we have removed the outliers based on trip durations and cab speeds
# lets try if there are any outliers in trip distances
# box-plot showing outliers in trip-distance values
sns.boxplot(y="trip_distance", data =frame_with_durations_modified)
plt.show()
In [27]:
#calculating trip distance values at each percntile 0,10,20,30,40,50,60,70,80,90,100 
for i in range(0,100,10):
    var =frame_with_durations_modified["trip_distance"].values
    var = np.sort(var,axis = None)
    print("{} percentile value is {}".format(i,var[int(len(var)*(float(i)/100))]))
print("100 percentile value is ",var[-1])
0 percentile value is 0.01
10 percentile value is 0.67
20 percentile value is 0.9
30 percentile value is 1.1
40 percentile value is 1.39
50 percentile value is 1.7
60 percentile value is 2.08
70 percentile value is 2.61
80 percentile value is 3.6
90 percentile value is 5.98
100 percentile value is  258.9
In [28]:
#calculating trip distance values at each percntile 90,91,92,93,94,95,96,97,98,99,100
for i in range(90,100):
    var =frame_with_durations_modified["trip_distance"].values
    var = np.sort(var,axis = None)
    print("{} percentile value is {}".format(i,var[int(len(var)*(float(i)/100))]))
print("100 percentile value is ",var[-1])
90 percentile value is 5.98
91 percentile value is 6.47
92 percentile value is 7.09
93 percentile value is 7.87
94 percentile value is 8.74
95 percentile value is 9.6
96 percentile value is 10.6
97 percentile value is 12.1
98 percentile value is 16.06
99 percentile value is 18.18
100 percentile value is  258.9
In [29]:
#calculating trip distance values at each percntile 99.0,99.1,99.2,99.3,99.4,99.5,99.6,99.7,99.8,99.9,100
for i in np.arange(0.0, 1.0, 0.1):
    var =frame_with_durations_modified["trip_distance"].values
    var = np.sort(var,axis = None)
    print("{} percentile value is {}".format(99+i,var[int(len(var)*(float(99+i)/100))]))
print("100 percentile value is ",var[-1])
99.0 percentile value is 18.18
99.1 percentile value is 18.37
99.2 percentile value is 18.6
99.3 percentile value is 18.84
99.4 percentile value is 19.14
99.5 percentile value is 19.5
99.6 percentile value is 19.97
99.7 percentile value is 20.51
99.8 percentile value is 21.23
99.9 percentile value is 22.58
100 percentile value is  258.9
In [30]:
#removing further outliers based on the 99.9th percentile value
frame_with_durations_modified=frame_with_durations[(frame_with_durations.trip_distance>0) & (frame_with_durations.trip_distance<23)]
In [31]:
#box-plot after removal of outliers
sns.boxplot(y="trip_distance", data = frame_with_durations_modified)
plt.show()

5. Total Fare

In [32]:
# up to now we have removed the outliers based on trip durations, cab speeds, and trip distances
# lets try if there are any outliers in based on the total_amount
# box-plot showing outliers in fare
sns.boxplot(y="total_amount", data =frame_with_durations_modified)
plt.show()
In [33]:
#calculating total fare amount values at each percntile 0,10,20,30,40,50,60,70,80,90,100 
for i in range(0,100,10):
    var = frame_with_durations_modified["total_amount"].values
    var = np.sort(var,axis = None)
    print("{} percentile value is {}".format(i,var[int(len(var)*(float(i)/100))]))
print("100 percentile value is ",var[-1])
0 percentile value is -242.55
10 percentile value is 6.3
20 percentile value is 7.8
30 percentile value is 8.8
40 percentile value is 9.8
50 percentile value is 11.16
60 percentile value is 12.8
70 percentile value is 14.8
80 percentile value is 18.3
90 percentile value is 25.8
100 percentile value is  3950611.6
In [34]:
#calculating total fare amount values at each percntile 90,91,92,93,94,95,96,97,98,99,100
for i in range(90,100):
    var = frame_with_durations_modified["total_amount"].values
    var = np.sort(var,axis = None)
    print("{} percentile value is {}".format(i,var[int(len(var)*(float(i)/100))]))
print("100 percentile value is ",var[-1])
90 percentile value is 25.8
91 percentile value is 27.3
92 percentile value is 29.3
93 percentile value is 31.8
94 percentile value is 34.8
95 percentile value is 38.53
96 percentile value is 42.6
97 percentile value is 48.13
98 percentile value is 58.13
99 percentile value is 66.13
100 percentile value is  3950611.6
In [35]:
#calculating total fare amount values at each percntile 99.0,99.1,99.2,99.3,99.4,99.5,99.6,99.7,99.8,99.9,100
for i in np.arange(0.0, 1.0, 0.1):
    var = frame_with_durations_modified["total_amount"].values
    var = np.sort(var,axis = None)
    print("{} percentile value is {}".format(99+i,var[int(len(var)*(float(99+i)/100))]))
print("100 percentile value is ",var[-1])
99.0 percentile value is 66.13
99.1 percentile value is 68.13
99.2 percentile value is 69.6
99.3 percentile value is 69.6
99.4 percentile value is 69.73
99.5 percentile value is 69.75
99.6 percentile value is 69.76
99.7 percentile value is 72.58
99.8 percentile value is 75.35
99.9 percentile value is 88.28
100 percentile value is  3950611.6

Observation:- As even the 99.9th percentile value doesnt look like an outlier,as there is not much difference between the 99.8th percentile and 99.9th percentile, we move on to do graphical analyis

In [36]:
#below plot shows us the fare values(sorted) to find a sharp increase to remove those values as outliers
# plot the fare amount excluding last two values in sorted data
plt.plot(var[:-2])
plt.show()
In [37]:
# a very sharp increase in fare values can be seen 
# plotting last three total fare values, and we can observe there is share increase in the values
plt.plot(var[-3:])
plt.show()
In [38]:
#now looking at values not including the last two points we again find a drastic increase at around 1000 fare value
# we plot last 50 values excluding last two values
plt.plot(var[-50:-2])
plt.show()

Remove all outliers/erronous points.

In [39]:
#removing all outliers based on our univariate analysis above
def remove_outliers(new_frame):

    
    a = new_frame.shape[0]
    print ("Number of pickup records = ",a)
    temp_frame = new_frame[((new_frame.dropoff_longitude >= -74.15) & (new_frame.dropoff_longitude <= -73.7004) &\
                       (new_frame.dropoff_latitude >= 40.5774) & (new_frame.dropoff_latitude <= 40.9176)) & \
                       ((new_frame.pickup_longitude >= -74.15) & (new_frame.pickup_latitude >= 40.5774)& \
                       (new_frame.pickup_longitude <= -73.7004) & (new_frame.pickup_latitude <= 40.9176))]
    b = temp_frame.shape[0]
    print ("Number of outlier coordinates lying outside NY boundaries:",(a-b))

    
    temp_frame = new_frame[(new_frame.trip_times > 0) & (new_frame.trip_times < 720)]
    c = temp_frame.shape[0]
    print ("Number of outliers from trip times analysis:",(a-c))
    
    
    temp_frame = new_frame[(new_frame.trip_distance > 0) & (new_frame.trip_distance < 23)]
    d = temp_frame.shape[0]
    print ("Number of outliers from trip distance analysis:",(a-d))
    
    temp_frame = new_frame[(new_frame.Speed <= 65) & (new_frame.Speed >= 0)]
    e = temp_frame.shape[0]
    print ("Number of outliers from speed analysis:",(a-e))
    
    temp_frame = new_frame[(new_frame.total_amount <1000) & (new_frame.total_amount >0)]
    f = temp_frame.shape[0]
    print ("Number of outliers from fare analysis:",(a-f))
    
    
    new_frame = new_frame[((new_frame.dropoff_longitude >= -74.15) & (new_frame.dropoff_longitude <= -73.7004) &\
                       (new_frame.dropoff_latitude >= 40.5774) & (new_frame.dropoff_latitude <= 40.9176)) & \
                       ((new_frame.pickup_longitude >= -74.15) & (new_frame.pickup_latitude >= 40.5774)& \
                       (new_frame.pickup_longitude <= -73.7004) & (new_frame.pickup_latitude <= 40.9176))]
    
    new_frame = new_frame[(new_frame.trip_times > 0) & (new_frame.trip_times < 720)]
    new_frame = new_frame[(new_frame.trip_distance > 0) & (new_frame.trip_distance < 23)]
    new_frame = new_frame[(new_frame.Speed < 45.31) & (new_frame.Speed > 0)]
    new_frame = new_frame[(new_frame.total_amount <1000) & (new_frame.total_amount >0)]
    
    print ("Total outliers removed",a - new_frame.shape[0])
    print ("---")
    return new_frame
In [40]:
print ("Removing outliers in the month of Jan-2015")
print ("----")
frame_with_durations_outliers_removed = remove_outliers(frame_with_durations)
print("fraction of data points that remain after removing outliers", float(len(frame_with_durations_outliers_removed))/len(frame_with_durations))
Removing outliers in the month of Jan-2015
----
Number of pickup records =  12748986
Number of outlier coordinates lying outside NY boundaries: 293919
Number of outliers from trip times analysis: 23889
Number of outliers from trip distance analysis: 92597
Number of outliers from speed analysis: 24473
Number of outliers from fare analysis: 5275
Total outliers removed 377910
---
fraction of data points that remain after removing outliers 0.9703576425607495

Data-preperation

Clustering/Segmentation

In [41]:
#trying different cluster sizes to choose the right K in K-means
coords = frame_with_durations_outliers_removed[['pickup_latitude', 'pickup_longitude']].values
neighbours=[]

def find_min_distance(cluster_centers, cluster_len):
    nice_points = 0
    wrong_points = 0
    less2 = []
    more2 = []
    min_dist=1000
    for i in range(0, cluster_len):
        nice_points = 0
        wrong_points = 0
        for j in range(0, cluster_len):
            if j!=i:
                distance = gpxpy.geo.haversine_distance(cluster_centers[i][0], cluster_centers[i][1],cluster_centers[j][0], cluster_centers[j][1])
                min_dist = min(min_dist,distance/(1.60934*1000))
                if (distance/(1.60934*1000)) <= 2:
                    nice_points +=1
                else:
                    wrong_points += 1
        less2.append(nice_points)
        more2.append(wrong_points)
    neighbours.append(less2)
    print ("On choosing a cluster size of ",cluster_len,"\nAvg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2):", np.ceil(sum(less2)/len(less2)), "\nAvg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2):", np.ceil(sum(more2)/len(more2)),"\nMin inter-cluster distance = ",min_dist,"\n---")

def find_clusters(increment):
    kmeans = MiniBatchKMeans(n_clusters=increment, batch_size=10000,random_state=42).fit(coords)
    frame_with_durations_outliers_removed['pickup_cluster'] = kmeans.predict(frame_with_durations_outliers_removed[['pickup_latitude', 'pickup_longitude']])
    cluster_centers = kmeans.cluster_centers_
    cluster_len = len(cluster_centers)
    return cluster_centers, cluster_len

# we need to choose number of clusters so that, there are more number of cluster regions 
#that are close to any cluster center
# and make sure that the minimum inter cluster should not be very less
for increment in range(10, 100, 10):
    cluster_centers, cluster_len = find_clusters(increment)
    find_min_distance(cluster_centers, cluster_len)            
On choosing a cluster size of  10 
Avg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 2.0 
Avg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 8.0 
Min inter-cluster distance =  1.0945442325142543 
---
On choosing a cluster size of  20 
Avg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 4.0 
Avg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 16.0 
Min inter-cluster distance =  0.7131298007387813 
---
On choosing a cluster size of  30 
Avg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 8.0 
Avg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 22.0 
Min inter-cluster distance =  0.5185088176172206 
---
On choosing a cluster size of  40 
Avg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 8.0 
Avg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 32.0 
Min inter-cluster distance =  0.5069768450363973 
---
On choosing a cluster size of  50 
Avg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 12.0 
Avg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 38.0 
Min inter-cluster distance =  0.365363025983595 
---
On choosing a cluster size of  60 
Avg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 14.0 
Avg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 46.0 
Min inter-cluster distance =  0.34704283494187155 
---
On choosing a cluster size of  70 
Avg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 16.0 
Avg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 54.0 
Min inter-cluster distance =  0.30502203163244707 
---
On choosing a cluster size of  80 
Avg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 18.0 
Avg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 62.0 
Min inter-cluster distance =  0.29220324531738534 
---
On choosing a cluster size of  90 
Avg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 21.0 
Avg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 69.0 
Min inter-cluster distance =  0.18257992857034985 
---

Inference:

  • The main objective was to find a optimal min. distance(Which roughly estimates to the radius of a cluster) between the clusters which we got was 40
In [42]:
# if check for the 50 clusters you can observe that there are two clusters with only 0.3 miles apart from each other
# so we choose 40 clusters for solve the further problem

# Getting 40 clusters using the kmeans 
kmeans = MiniBatchKMeans(n_clusters=40, batch_size=10000,random_state=0).fit(coords)
frame_with_durations_outliers_removed['pickup_cluster'] = kmeans.predict(frame_with_durations_outliers_removed[['pickup_latitude', 'pickup_longitude']])

Plotting the cluster centers:

In [43]:
# Plotting the cluster centers on OSM
cluster_centers = kmeans.cluster_centers_
cluster_len = len(cluster_centers)
map_osm = folium.Map(location=[40.734695, -73.990372], tiles='Stamen Toner')
for i in range(cluster_len):
    folium.Marker(list((cluster_centers[i][0],cluster_centers[i][1])), popup=(str(cluster_centers[i][0])+str(cluster_centers[i][1]))).add_to(map_osm)
map_osm
Out[43]:

Plotting the clusters:

In [44]:
#Visualising the clusters on a map
def plot_clusters(frame):
    city_long_border = (-74.03, -73.75)
    city_lat_border = (40.63, 40.85)
    fig, ax = plt.subplots(ncols=1, nrows=1)
    ax.scatter(frame.pickup_longitude.values[:100000], frame.pickup_latitude.values[:100000], s=10, lw=0,
               c=frame.pickup_cluster.values[:100000], cmap='tab20', alpha=0.2)
    ax.set_xlim(city_long_border)
    ax.set_ylim(city_lat_border)
    ax.set_xlabel('Longitude')
    ax.set_ylabel('Latitude')
    plt.show()

plot_clusters(frame_with_durations_outliers_removed)

Time-binning

In [45]:
#Refer:https://www.unixtimestamp.com/
# 1420070400 : 2015-01-01 00:00:00 
# 1422748800 : 2015-02-01 00:00:00 
# 1425168000 : 2015-03-01 00:00:00
# 1427846400 : 2015-04-01 00:00:00 
# 1430438400 : 2015-05-01 00:00:00 
# 1433116800 : 2015-06-01 00:00:00

# 1451606400 : 2016-01-01 00:00:00 
# 1454284800 : 2016-02-01 00:00:00 
# 1456790400 : 2016-03-01 00:00:00
# 1459468800 : 2016-04-01 00:00:00 
# 1462060800 : 2016-05-01 00:00:00 
# 1464739200 : 2016-06-01 00:00:00

def add_pickup_bins(frame,month,year):
    unix_pickup_times=[i for i in frame['pickup_times'].values]
    unix_times = [[1420070400,1422748800,1425168000,1427846400,1430438400,1433116800],\
                    [1451606400,1454284800,1456790400,1459468800,1462060800,1464739200]]
    
    start_pickup_unix=unix_times[year-2015][month-1]
    # https://www.timeanddate.com/time/zones/est
    # (int((i-start_pickup_unix)/600)+33) : our unix time is in gmt to we are converting it to est
    tenminutewise_binned_unix_pickup_times=[(int((i-start_pickup_unix)/600)+33) for i in unix_pickup_times]
    frame['pickup_bins'] = np.array(tenminutewise_binned_unix_pickup_times)
    return frame
In [46]:
# clustering, making pickup bins and grouping by pickup cluster and pickup bins
frame_with_durations_outliers_removed['pickup_cluster'] = kmeans.predict(frame_with_durations_outliers_removed[['pickup_latitude', 'pickup_longitude']])
jan_2015_frame = add_pickup_bins(frame_with_durations_outliers_removed,1,2015)
jan_2015_groupby = jan_2015_frame[['pickup_cluster','pickup_bins','trip_distance']].groupby(['pickup_cluster','pickup_bins']).count()
In [47]:
# we add two more columns 'pickup_cluster'(to which cluster it belogns to) 
# and 'pickup_bins' (to which 10min intravel the trip belongs to)
jan_2015_frame.head()
Out[47]:
passenger_count trip_distance pickup_longitude pickup_latitude dropoff_longitude dropoff_latitude total_amount trip_times pickup_times Speed pickup_cluster pickup_bins
0 1 1.59 -73.993896 40.750111 -73.974785 40.750618 17.05 18.050000 1.421329e+09 5.285319 34 2130
1 1 3.30 -74.001648 40.724243 -73.994415 40.759109 17.80 19.833333 1.420902e+09 9.983193 2 1419
2 1 1.80 -73.963341 40.802788 -73.951820 40.824413 10.80 10.050000 1.420902e+09 10.746269 16 1419
3 1 0.50 -74.009087 40.713818 -74.004326 40.719986 4.80 1.866667 1.420902e+09 16.071429 38 1419
4 1 3.00 -73.971176 40.762428 -74.004181 40.742653 16.30 19.316667 1.420902e+09 9.318378 22 1419
In [48]:
# hear the trip_distance represents the number of pickups that are happend in that particular 10min intravel
# this data frame has two indices
# primary index: pickup_cluster (cluster number)
# secondary index : pickup_bins (we devid whole months time into 10min intravels 24*31*60/10 =4464bins)
jan_2015_groupby.head()
Out[48]:
trip_distance
pickup_cluster pickup_bins
0 1 105
2 199
3 208
4 141
5 155
In [49]:
# upto now we cleaned data and prepared data for the month 2015,

# now do the same operations for months Jan, Feb, March of 2016
# 1. get the dataframe which inlcudes only required colums
# 2. adding trip times, speed, unix time stamp of pickup_time
# 4. remove the outliers based on trip_times, speed, trip_duration, total_amount
# 5. add pickup_cluster to each data point
# 6. add pickup_bin (index of 10min intravel to which that trip belongs to)
# 7. group by data, based on 'pickup_cluster' and 'pickuo_bin'

# Data Preparation for the months of Jan,Feb and March 2016
def datapreparation(month,kmeans,month_no,year_no):
    
    print ("Return with trip times..")

    frame_with_durations = return_with_trip_times(month)
    
    print ("Remove outliers..")
    frame_with_durations_outliers_removed = remove_outliers(frame_with_durations)
    
    print ("Estimating clusters..")
    frame_with_durations_outliers_removed['pickup_cluster'] = kmeans.predict(frame_with_durations_outliers_removed[['pickup_latitude', 'pickup_longitude']])
    #frame_with_durations_outliers_removed_2016['pickup_cluster'] = kmeans.predict(frame_with_durations_outliers_removed_2016[['pickup_latitude', 'pickup_longitude']])

    print ("Final groupbying..")
    final_updated_frame = add_pickup_bins(frame_with_durations_outliers_removed,month_no,year_no)
    final_groupby_frame = final_updated_frame[['pickup_cluster','pickup_bins','trip_distance']].groupby(['pickup_cluster','pickup_bins']).count()
    
    return final_updated_frame,final_groupby_frame
    
month_jan_2016 = dd.read_csv('yellow_tripdata_2016-01.csv')
month_feb_2016 = dd.read_csv('yellow_tripdata_2016-02.csv')
month_mar_2016 = dd.read_csv('yellow_tripdata_2016-03.csv')

jan_2016_frame,jan_2016_groupby = datapreparation(month_jan_2016,kmeans,1,2016)
feb_2016_frame,feb_2016_groupby = datapreparation(month_feb_2016,kmeans,2,2016)
mar_2016_frame,mar_2016_groupby = datapreparation(month_mar_2016,kmeans,3,2016)
Return with trip times..
Remove outliers..
Number of pickup records =  10906858
Number of outlier coordinates lying outside NY boundaries: 214677
Number of outliers from trip times analysis: 27190
Number of outliers from trip distance analysis: 79742
Number of outliers from speed analysis: 21047
Number of outliers from fare analysis: 4991
Total outliers removed 297784
---
Estimating clusters..
Final groupbying..
Return with trip times..
Remove outliers..
Number of pickup records =  11382049
Number of outlier coordinates lying outside NY boundaries: 223161
Number of outliers from trip times analysis: 27670
Number of outliers from trip distance analysis: 81902
Number of outliers from speed analysis: 22437
Number of outliers from fare analysis: 5476
Total outliers removed 308177
---
Estimating clusters..
Final groupbying..
Return with trip times..
Remove outliers..
Number of pickup records =  12210952
Number of outlier coordinates lying outside NY boundaries: 232444
Number of outliers from trip times analysis: 30868
Number of outliers from trip distance analysis: 87318
Number of outliers from speed analysis: 23889
Number of outliers from fare analysis: 5859
Total outliers removed 324635
---
Estimating clusters..
Final groupbying..

Smoothing

In [50]:
# Gets the unique bins where pickup values are present for each each reigion

# for each cluster region we will collect all the indices of 10min intravels in which the pickups are happened
# we got an observation that there are some pickpbins that doesnt have any pickups
def return_unq_pickup_bins(frame):
    values = []
    for i in range(0,40):
        new = frame[frame['pickup_cluster'] == i]
        list_unq = list(set(new['pickup_bins']))
        list_unq.sort()
        values.append(list_unq)
    return values
In [51]:
# for every month we get all indices of 10min intravels in which atleast one pickup got happened

#jan
jan_2015_unique = return_unq_pickup_bins(jan_2015_frame)
jan_2016_unique = return_unq_pickup_bins(jan_2016_frame)

#feb
feb_2016_unique = return_unq_pickup_bins(feb_2016_frame)

#march
mar_2016_unique = return_unq_pickup_bins(mar_2016_frame)
In [52]:
# for each cluster number of 10min intravels with 0 pickups
for i in range(40):
    print("for the ",i,"th cluster number of 10min intavels with zero pickups: ",4464 - len(set(jan_2015_unique[i])))
    print('-'*60)
for the  0 th cluster number of 10min intavels with zero pickups:  41
------------------------------------------------------------
for the  1 th cluster number of 10min intavels with zero pickups:  1986
------------------------------------------------------------
for the  2 th cluster number of 10min intavels with zero pickups:  30
------------------------------------------------------------
for the  3 th cluster number of 10min intavels with zero pickups:  355
------------------------------------------------------------
for the  4 th cluster number of 10min intavels with zero pickups:  38
------------------------------------------------------------
for the  5 th cluster number of 10min intavels with zero pickups:  154
------------------------------------------------------------
for the  6 th cluster number of 10min intavels with zero pickups:  35
------------------------------------------------------------
for the  7 th cluster number of 10min intavels with zero pickups:  34
------------------------------------------------------------
for the  8 th cluster number of 10min intavels with zero pickups:  118
------------------------------------------------------------
for the  9 th cluster number of 10min intavels with zero pickups:  41
------------------------------------------------------------
for the  10 th cluster number of 10min intavels with zero pickups:  26
------------------------------------------------------------
for the  11 th cluster number of 10min intavels with zero pickups:  45
------------------------------------------------------------
for the  12 th cluster number of 10min intavels with zero pickups:  43
------------------------------------------------------------
for the  13 th cluster number of 10min intavels with zero pickups:  29
------------------------------------------------------------
for the  14 th cluster number of 10min intavels with zero pickups:  27
------------------------------------------------------------
for the  15 th cluster number of 10min intavels with zero pickups:  32
------------------------------------------------------------
for the  16 th cluster number of 10min intavels with zero pickups:  41
------------------------------------------------------------
for the  17 th cluster number of 10min intavels with zero pickups:  59
------------------------------------------------------------
for the  18 th cluster number of 10min intavels with zero pickups:  1191
------------------------------------------------------------
for the  19 th cluster number of 10min intavels with zero pickups:  1358
------------------------------------------------------------
for the  20 th cluster number of 10min intavels with zero pickups:  54
------------------------------------------------------------
for the  21 th cluster number of 10min intavels with zero pickups:  30
------------------------------------------------------------
for the  22 th cluster number of 10min intavels with zero pickups:  30
------------------------------------------------------------
for the  23 th cluster number of 10min intavels with zero pickups:  164
------------------------------------------------------------
for the  24 th cluster number of 10min intavels with zero pickups:  36
------------------------------------------------------------
for the  25 th cluster number of 10min intavels with zero pickups:  42
------------------------------------------------------------
for the  26 th cluster number of 10min intavels with zero pickups:  32
------------------------------------------------------------
for the  27 th cluster number of 10min intavels with zero pickups:  215
------------------------------------------------------------
for the  28 th cluster number of 10min intavels with zero pickups:  37
------------------------------------------------------------
for the  29 th cluster number of 10min intavels with zero pickups:  42
------------------------------------------------------------
for the  30 th cluster number of 10min intavels with zero pickups:  1181
------------------------------------------------------------
for the  31 th cluster number of 10min intavels with zero pickups:  43
------------------------------------------------------------
for the  32 th cluster number of 10min intavels with zero pickups:  45
------------------------------------------------------------
for the  33 th cluster number of 10min intavels with zero pickups:  44
------------------------------------------------------------
for the  34 th cluster number of 10min intavels with zero pickups:  40
------------------------------------------------------------
for the  35 th cluster number of 10min intavels with zero pickups:  43
------------------------------------------------------------
for the  36 th cluster number of 10min intavels with zero pickups:  37
------------------------------------------------------------
for the  37 th cluster number of 10min intavels with zero pickups:  322
------------------------------------------------------------
for the  38 th cluster number of 10min intavels with zero pickups:  37
------------------------------------------------------------
for the  39 th cluster number of 10min intavels with zero pickups:  44
------------------------------------------------------------

there are two ways to fill up these values

  • Fill the missing value with 0's
  • Fill the missing values with the avg values
    • Case 1:(values missing at the start)
      Ex1: \_ \_ \_ x =>ceil(x/4), ceil(x/4), ceil(x/4), ceil(x/4)
      Ex2: \_ \_ x => ceil(x/3), ceil(x/3), ceil(x/3)
    • Case 2:(values missing in middle)
      Ex1: x \_ \_ y => ceil((x+y)/4), ceil((x+y)/4), ceil((x+y)/4), ceil((x+y)/4)
      Ex2: x \_ \_ \_ y => ceil((x+y)/5), ceil((x+y)/5), ceil((x+y)/5), ceil((x+y)/5), ceil((x+y)/5)
    • Case 3:(values missing at the end)
      Ex1: x \_ \_ \_ => ceil(x/4), ceil(x/4), ceil(x/4), ceil(x/4)
      Ex2: x \_ => ceil(x/2), ceil(x/2)
In [53]:
# Fills a value of zero for every bin where no pickup data is present 
# the count_values: number pickps that are happened in each region for each 10min intravel
# there wont be any value if there are no picksups.
# values: number of unique bins

# for every 10min intravel(pickup_bin) we will check it is there in our unique bin,
# if it is there we will add the count_values[index] to smoothed data
# if not we add 0 to the smoothed data
# we finally return smoothed data
def fill_missing(count_values,values):
    smoothed_regions=[]
    ind=0
    for r in range(0,40):
        smoothed_bins=[]
        for i in range(4464):
            if i in values[r]:
                smoothed_bins.append(count_values[ind])
                ind+=1
            else:
                smoothed_bins.append(0)
        smoothed_regions.extend(smoothed_bins)
    return smoothed_regions
In [54]:
# Fills a value of zero for every bin where no pickup data is present 
# the count_values: number pickps that are happened in each region for each 10min intravel
# there wont be any value if there are no picksups.
# values: number of unique bins

# for every 10min intravel(pickup_bin) we will check it is there in our unique bin,
# if it is there we will add the count_values[index] to smoothed data
# if not we add smoothed data (which is calculated based on the methods that are discussed in the above markdown cell)
# we finally return smoothed data
def smoothing(count_values,values):
    smoothed_regions=[] # stores list of final smoothed values of each reigion
    ind=0
    repeat=0 
    smoothed_value=0
    for r in range(0,40):
        smoothed_bins=[] #stores the final smoothed values
        repeat=0
        for i in range(4464):
            if repeat!=0: # prevents iteration for a value which is already visited/resolved
                repeat-=1
                continue
            if i in values[r]: #checks if the pickup-bin exists 
                smoothed_bins.append(count_values[ind]) # appends the value of the pickup bin if it exists
            else:
                if i!=0:
                    right_hand_limit=0
                    for j in range(i,4464):
                        if  j not in values[r]: #searches for the left-limit or the pickup-bin value which has a pickup value
                            continue
                        else:
                            right_hand_limit=j
                            break
                    if right_hand_limit==0:
                    #Case 1: When we have the last/last few values are found to be missing,hence we have no right-limit here
                        smoothed_value=count_values[ind-1]*1.0/((4463-i)+2)*1.0                               
                        for j in range(i,4464):                              
                            smoothed_bins.append(math.ceil(smoothed_value))
                        smoothed_bins[i-1] = math.ceil(smoothed_value)
                        repeat=(4463-i)
                        ind-=1
                    else:
                    #Case 2: When we have the missing values between two known values
                        smoothed_value=(count_values[ind-1]+count_values[ind])*1.0/((right_hand_limit-i)+2)*1.0             
                        for j in range(i,right_hand_limit+1):
                            smoothed_bins.append(math.ceil(smoothed_value))
                        smoothed_bins[i-1] = math.ceil(smoothed_value)
                        repeat=(right_hand_limit-i)
                else:
                    #Case 3: When we have the first/first few values are found to be missing,hence we have no left-limit here
                    right_hand_limit=0
                    for j in range(i,4464):
                        if  j not in values[r]:
                            continue
                        else:
                            right_hand_limit=j
                            break
                    smoothed_value=count_values[ind]*1.0/((right_hand_limit-i)+1)*1.0
                    for j in range(i,right_hand_limit+1):
                            smoothed_bins.append(math.ceil(smoothed_value))
                    repeat=(right_hand_limit-i)
            ind+=1
        smoothed_regions.extend(smoothed_bins)
    return smoothed_regions
In [55]:
#Filling Missing values of Jan-2015 with 0
# here in jan_2015_groupby dataframe the trip_distance represents the number of pickups that are happened
jan_2015_fill = fill_missing(jan_2015_groupby['trip_distance'].values,jan_2015_unique)

#Smoothing Missing values of Jan-2015
jan_2015_smooth = smoothing(jan_2015_groupby['trip_distance'].values,jan_2015_unique)
In [56]:
# number of 10min indices for jan 2015= 24*31*60/10 = 4464
# number of 10min indices for jan 2016 = 24*31*60/10 = 4464
# number of 10min indices for feb 2016 = 24*29*60/10 = 4176
# number of 10min indices for march 2016 = 24*30*60/10 = 4320
# for each cluster we will have 4464 values, therefore 40*4464 = 178560 (length of the jan_2015_fill)
print("number of 10min intravels among all the clusters ",len(jan_2015_fill))
number of 10min intravels among all the clusters  178560
In [57]:
# Smoothing vs Filling
# sample plot that shows two variations of filling missing values
# we have taken the number of pickups for cluster region 2
plt.figure(figsize=(10,5))
plt.plot(jan_2015_fill[4464:8920], label="zero filled values")
plt.plot(jan_2015_smooth[4464:8920], label="filled with avg values")
plt.legend()
plt.show()
In [58]:
# why we choose, these methods and which method is used for which data?

# Ans: consider we have data of some month in 2015 jan 1st, 10 _ _ _ 20, i.e there are 10 pickups that are happened in 1st 
# 10st 10min intravel, 0 pickups happened in 2nd 10mins intravel, 0 pickups happened in 3rd 10min intravel 
# and 20 pickups happened in 4th 10min intravel.
# in fill_missing method we replace these values like 10, 0, 0, 20
# where as in smoothing method we replace these values as 6,6,6,6,6, if you can check the number of pickups 
# that are happened in the first 40min are same in both cases, but if you can observe that we looking at the future values 
# wheen you are using smoothing we are looking at the future number of pickups which might cause a data leakage.

# so we use smoothing for jan 2015th data since it acts as our training data
# and we use simple fill_misssing method for 2016th data.
In [59]:
# Jan-2015 data is smoothed, Jan,Feb & March 2016 data missing values are filled with zero
jan_2015_smooth = smoothing(jan_2015_groupby['trip_distance'].values,jan_2015_unique)
jan_2016_smooth = fill_missing(jan_2016_groupby['trip_distance'].values,jan_2016_unique)
feb_2016_smooth = fill_missing(feb_2016_groupby['trip_distance'].values,feb_2016_unique)
mar_2016_smooth = fill_missing(mar_2016_groupby['trip_distance'].values,mar_2016_unique)

# Making list of all the values of pickup data in every bin for a period of 3 months and storing them region-wise 
regions_cum = []

# a =[1,2,3]
# b = [2,3,4]
# a+b = [1, 2, 3, 2, 3, 4]

# number of 10min indices for jan 2015= 24*31*60/10 = 4464
# number of 10min indices for jan 2016 = 24*31*60/10 = 4464
# number of 10min indices for feb 2016 = 24*29*60/10 = 4176
# number of 10min indices for march 2016 = 24*31*60/10 = 4464
# regions_cum: it will contain 40 lists, each list will contain 4464+4176+4464 values which represents the number of pickups 
# that are happened for three months in 2016 data

for i in range(0,40):
    regions_cum.append(jan_2016_smooth[4464*i:4464*(i+1)]+feb_2016_smooth[4176*i:4176*(i+1)]+mar_2016_smooth[4464*i:4464*(i+1)])

# print(len(regions_cum))
# 40
# print(len(regions_cum[0]))
# 13104

Time series and Fourier Transforms

In [60]:
def uniqueish_color():
    """There're better ways to generate unique colors, but this isn't awful."""
    return plt.cm.gist_ncar(np.random.random())
first_x = list(range(0,4464))
second_x = list(range(4464,8640))
third_x = list(range(8640,13104))
for i in range(40):
    plt.figure(figsize=(10,4))
    plt.plot(first_x,regions_cum[i][:4464], color=uniqueish_color(), label='2016 Jan month data')
    plt.plot(second_x,regions_cum[i][4464:8640], color=uniqueish_color(), label='2016 feb month data')
    plt.plot(third_x,regions_cum[i][8640:], color=uniqueish_color(), label='2016 march month data')
    plt.legend()
    plt.show()
In [61]:
# getting peaks: https://blog.ytotech.com/2015/11/01/findpeaks-in-python/
# read more about fft function : https://docs.scipy.org/doc/numpy/reference/generated/numpy.fft.fft.html
Y    = np.fft.fft(np.array(jan_2016_smooth)[0:4460])
# read more about the fftfreq: https://docs.scipy.org/doc/numpy/reference/generated/numpy.fft.fftfreq.html  
freq = np.fft.fftfreq(4460, 1)
n = len(freq)
print(n)
plt.figure()
plt.plot( freq[:int(n/2)], np.abs(Y)[:int(n/2)] )
plt.xlabel("Frequency")
plt.ylabel("Amplitude")
plt.show()
4460
In [62]:
#Preparing the Dataframe only with x(i) values as jan-2015 data and y(i) values as jan-2016
ratios_jan = pd.DataFrame()
ratios_jan['Given']=jan_2015_smooth
ratios_jan['Prediction']=jan_2016_smooth
ratios_jan['Ratios']=ratios_jan['Prediction']*1.0/ratios_jan['Given']*1.0
In [177]:
#print(ratios_jan.head())
print(ratios_jan.shape)
(178560, 15)

Modelling: Baseline Models

Now we get into modelling in order to forecast the pickup densities for the months of Jan, Feb and March of 2016 for which we are using multiple models with two variations

  1. Using Ratios of the 2016 data to the 2015 data i.e $\begin{align} R_{t} = P^{2016}_{t} / P^{2015}_{t} \end{align}$
  2. Using Previous known values of the 2016 data itself to predict the future values

Simple Moving Averages

The First Model used is the Moving Averages Model which uses the previous n values in order to predict the next value

Using Ratio Values - $\begin{align}R_{t} = ( R_{t-1} + R_{t-2} + R_{t-3} .... R_{t-n} )/n \end{align}$

In [63]:
def MA_R_Predictions(ratios,month):
    predicted_ratio=(ratios['Ratios'].values)[0]
    error=[]
    predicted_values=[]
    window_size=3
    predicted_ratio_values=[]
    for i in range(0,4464*40):
        if i%4464==0:
            predicted_ratio_values.append(0)
            predicted_values.append(0)
            error.append(0)
            continue
        predicted_ratio_values.append(predicted_ratio)
        predicted_values.append(int(((ratios['Given'].values)[i])*predicted_ratio))
        error.append(abs((math.pow(int(((ratios['Given'].values)[i])*predicted_ratio)-(ratios['Prediction'].values)[i],1))))
        if i+1>=window_size:
            predicted_ratio=sum((ratios['Ratios'].values)[(i+1)-window_size:(i+1)])/window_size
        else:
            predicted_ratio=sum((ratios['Ratios'].values)[0:(i+1)])/(i+1)
            
    
    ratios['MA_R_Predicted'] = predicted_values
    ratios['MA_R_Error'] = error
    mape_err = (sum(error)/len(error))/(sum(ratios['Prediction'].values)/len(ratios['Prediction'].values))
    mse_err = sum([e**2 for e in error])/len(error)
    return ratios,mape_err,mse_err

For the above the Hyperparameter is the window-size (n) which is tuned manually and it is found that the window-size of 3 is optimal for getting the best results using Moving Averages using previous Ratio values therefore we get $\begin{align}R_{t} = ( R_{t-1} + R_{t-2} + R_{t-3})/3 \end{align}$

Next we use the Moving averages of the 2016 values itself to predict the future value using $\begin{align}P_{t} = ( P_{t-1} + P_{t-2} + P_{t-3} .... P_{t-n} )/n \end{align}$

In [64]:
def MA_P_Predictions(ratios,month):
    predicted_value=(ratios['Prediction'].values)[0]
    error=[]
    predicted_values=[]
    window_size=1
    predicted_ratio_values=[]
    for i in range(0,4464*40):
        predicted_values.append(predicted_value)
        error.append(abs((math.pow(predicted_value-(ratios['Prediction'].values)[i],1))))
        if i+1>=window_size:
            predicted_value=int(sum((ratios['Prediction'].values)[(i+1)-window_size:(i+1)])/window_size)
        else:
            predicted_value=int(sum((ratios['Prediction'].values)[0:(i+1)])/(i+1))
            
    ratios['MA_P_Predicted'] = predicted_values
    ratios['MA_P_Error'] = error
    mape_err = (sum(error)/len(error))/(sum(ratios['Prediction'].values)/len(ratios['Prediction'].values))
    mse_err = sum([e**2 for e in error])/len(error)
    return ratios,mape_err,mse_err

For the above the Hyperparameter is the window-size (n) which is tuned manually and it is found that the window-size of 1 is optimal for getting the best results using Moving Averages using previous 2016 values therefore we get $\begin{align}P_{t} = P_{t-1} \end{align}$

Weighted Moving Averages

The Moving Avergaes Model used gave equal importance to all the values in the window used, but we know intuitively that the future is more likely to be similar to the latest values and less similar to the older values. Weighted Averages converts this analogy into a mathematical relationship giving the highest weight while computing the averages to the latest previous value and decreasing weights to the subsequent older ones

Weighted Moving Averages using Ratio Values - $\begin{align}R_{t} = ( N*R_{t-1} + (N-1)*R_{t-2} + (N-2)*R_{t-3} .... 1*R_{t-n} )/(N*(N+1)/2) \end{align}$

In [65]:
def WA_R_Predictions(ratios,month):
    predicted_ratio=(ratios['Ratios'].values)[0]
    alpha=0.5
    error=[]
    predicted_values=[]
    window_size=5
    predicted_ratio_values=[]
    for i in range(0,4464*40):
        if i%4464==0:
            predicted_ratio_values.append(0)
            predicted_values.append(0)
            error.append(0)
            continue
        predicted_ratio_values.append(predicted_ratio)
        predicted_values.append(int(((ratios['Given'].values)[i])*predicted_ratio))
        error.append(abs((math.pow(int(((ratios['Given'].values)[i])*predicted_ratio)-(ratios['Prediction'].values)[i],1))))
        if i+1>=window_size:
            sum_values=0
            sum_of_coeff=0
            for j in range(window_size,0,-1):
                sum_values += j*(ratios['Ratios'].values)[i-window_size+j]
                sum_of_coeff+=j
            predicted_ratio=sum_values/sum_of_coeff
        else:
            sum_values=0
            sum_of_coeff=0
            for j in range(i+1,0,-1):
                sum_values += j*(ratios['Ratios'].values)[j-1]
                sum_of_coeff+=j
            predicted_ratio=sum_values/sum_of_coeff
            
    ratios['WA_R_Predicted'] = predicted_values
    ratios['WA_R_Error'] = error
    mape_err = (sum(error)/len(error))/(sum(ratios['Prediction'].values)/len(ratios['Prediction'].values))
    mse_err = sum([e**2 for e in error])/len(error)
    return ratios,mape_err,mse_err

For the above the Hyperparameter is the window-size (n) which is tuned manually and it is found that the window-size of 5 is optimal for getting the best results using Weighted Moving Averages using previous Ratio values therefore we get $\begin{align} R_{t} = ( 5*R_{t-1} + 4*R_{t-2} + 3*R_{t-3} + 2*R_{t-4} + R_{t-5} )/15 \end{align}$

Weighted Moving Averages using Previous 2016 Values - $\begin{align}P_{t} = ( N*P_{t-1} + (N-1)*P_{t-2} + (N-2)*P_{t-3} .... 1*P_{t-n} )/(N*(N+1)/2) \end{align}$

In [66]:
def WA_P_Predictions(ratios,month):
    predicted_value=(ratios['Prediction'].values)[0]
    error=[]
    predicted_values=[]
    window_size=2
    for i in range(0,4464*40):
        predicted_values.append(predicted_value)
        error.append(abs((math.pow(predicted_value-(ratios['Prediction'].values)[i],1))))
        if i+1>=window_size:
            sum_values=0
            sum_of_coeff=0
            for j in range(window_size,0,-1):
                sum_values += j*(ratios['Prediction'].values)[i-window_size+j]
                sum_of_coeff+=j
            predicted_value=int(sum_values/sum_of_coeff)

        else:
            sum_values=0
            sum_of_coeff=0
            for j in range(i+1,0,-1):
                sum_values += j*(ratios['Prediction'].values)[j-1]
                sum_of_coeff+=j
            predicted_value=int(sum_values/sum_of_coeff)
    
    ratios['WA_P_Predicted'] = predicted_values
    ratios['WA_P_Error'] = error
    mape_err = (sum(error)/len(error))/(sum(ratios['Prediction'].values)/len(ratios['Prediction'].values))
    mse_err = sum([e**2 for e in error])/len(error)
    return ratios,mape_err,mse_err

For the above the Hyperparameter is the window-size (n) which is tuned manually and it is found that the window-size of 2 is optimal for getting the best results using Weighted Moving Averages using previous 2016 values therefore we get $\begin{align} P_{t} = ( 2*P_{t-1} + P_{t-2} )/3 \end{align}$

Exponential Weighted Moving Averages

https://en.wikipedia.org/wiki/Moving_average#Exponential_moving_average Through weighted averaged we have satisfied the analogy of giving higher weights to the latest value and decreasing weights to the subsequent ones but we still do not know which is the correct weighting scheme as there are infinetly many possibilities in which we can assign weights in a non-increasing order and tune the the hyperparameter window-size. To simplify this process we use Exponential Moving Averages which is a more logical way towards assigning weights and at the same time also using an optimal window-size.

In exponential moving averages we use a single hyperparameter alpha $\begin{align}(\alpha)\end{align}$ which is a value between 0 & 1 and based on the value of the hyperparameter alpha the weights and the window sizes are configured.
For eg. If $\begin{align}\alpha=0.9\end{align}$ then the number of days on which the value of the current iteration is based is~$\begin{align}1/(1-\alpha)=10\end{align}$ i.e. we consider values 10 days prior before we predict the value for the current iteration. Also the weights are assigned using $\begin{align}2/(N+1)=0.18\end{align}$ ,where N = number of prior values being considered, hence from this it is implied that the first or latest value is assigned a weight of 0.18 which keeps exponentially decreasing for the subsequent values.

$\begin{align}R^{'}_{t} = \alpha*R_{t-1} + (1-\alpha)*R^{'}_{t-1} \end{align}$

In [67]:
def EA_R1_Predictions(ratios,month):
    predicted_ratio=(ratios['Ratios'].values)[0]
    alpha=0.6
    error=[]
    predicted_values=[]
    predicted_ratio_values=[]
    for i in range(0,4464*40):
        if i%4464==0:
            predicted_ratio_values.append(0)
            predicted_values.append(0)
            error.append(0)
            continue
        predicted_ratio_values.append(predicted_ratio)
        predicted_values.append(int(((ratios['Given'].values)[i])*predicted_ratio))
        error.append(abs((math.pow(int(((ratios['Given'].values)[i])*predicted_ratio)-(ratios['Prediction'].values)[i],1))))
        predicted_ratio = (alpha*predicted_ratio) + (1-alpha)*((ratios['Ratios'].values)[i])
    
    ratios['EA_R1_Predicted'] = predicted_values
    ratios['EA_R1_Error'] = error
    mape_err = (sum(error)/len(error))/(sum(ratios['Prediction'].values)/len(ratios['Prediction'].values))
    mse_err = sum([e**2 for e in error])/len(error)
    return ratios,mape_err,mse_err

$\begin{align}P^{'}_{t} = \alpha*P_{t-1} + (1-\alpha)*P^{'}_{t-1} \end{align}$

In [68]:
def EA_P1_Predictions(ratios,month):
    predicted_value= (ratios['Prediction'].values)[0]
    alpha=0.3
    error=[]
    predicted_values=[]
    for i in range(0,4464*40):
        if i%4464==0:
            predicted_values.append(0)
            error.append(0)
            continue
        predicted_values.append(predicted_value)
        error.append(abs((math.pow(predicted_value-(ratios['Prediction'].values)[i],1))))
        predicted_value =int((alpha*predicted_value) + (1-alpha)*((ratios['Prediction'].values)[i]))
    
    ratios['EA_P1_Predicted'] = predicted_values
    ratios['EA_P1_Error'] = error
    mape_err = (sum(error)/len(error))/(sum(ratios['Prediction'].values)/len(ratios['Prediction'].values))
    mse_err = sum([e**2 for e in error])/len(error)
    return ratios,mape_err,mse_err
In [69]:
mean_err=[0]*10
median_err=[0]*10
ratios_jan,mean_err[0],median_err[0]=MA_R_Predictions(ratios_jan,'jan')
ratios_jan,mean_err[1],median_err[1]=MA_P_Predictions(ratios_jan,'jan')
ratios_jan,mean_err[2],median_err[2]=WA_R_Predictions(ratios_jan,'jan')
ratios_jan,mean_err[3],median_err[3]=WA_P_Predictions(ratios_jan,'jan')
ratios_jan,mean_err[4],median_err[4]=EA_R1_Predictions(ratios_jan,'jan')
ratios_jan,mean_err[5],median_err[5]=EA_P1_Predictions(ratios_jan,'jan')

Comparison between baseline models

We have chosen our error metric for comparison between models as MAPE (Mean Absolute Percentage Error) so that we can know that on an average how good is our model with predictions and MSE (Mean Squared Error) is also used so that we have a clearer understanding as to how well our forecasting model performs with outliers so that we make sure that there is not much of a error margin between our prediction and the actual value

In [70]:
print ("Error Metric Matrix (Forecasting Methods) - MAPE & MSE")
print ("--------------------------------------------------------------------------------------------------------")
print ("Moving Averages (Ratios) -                             MAPE: ",mean_err[0],"      MSE: ",median_err[0])
print ("Moving Averages (2016 Values) -                        MAPE: ",mean_err[1],"       MSE: ",median_err[1])
print ("--------------------------------------------------------------------------------------------------------")
print ("Weighted Moving Averages (Ratios) -                    MAPE: ",mean_err[2],"      MSE: ",median_err[2])
print ("Weighted Moving Averages (2016 Values) -               MAPE: ",mean_err[3],"      MSE: ",median_err[3])
print ("--------------------------------------------------------------------------------------------------------")
print ("Exponential Moving Averages (Ratios) -              MAPE: ",mean_err[4],"      MSE: ",median_err[4])
print ("Exponential Moving Averages (2016 Values) -         MAPE: ",mean_err[5],"      MSE: ",median_err[5])
Error Metric Matrix (Forecasting Methods) - MAPE & MSE
--------------------------------------------------------------------------------------------------------
Moving Averages (Ratios) -                             MAPE:  0.1821155173392136       MSE:  400.0625504032258
Moving Averages (2016 Values) -                        MAPE:  0.14292849686975506        MSE:  174.84901993727598
--------------------------------------------------------------------------------------------------------
Weighted Moving Averages (Ratios) -                    MAPE:  0.1784869254376018       MSE:  384.01578741039424
Weighted Moving Averages (2016 Values) -               MAPE:  0.13551088436182082       MSE:  162.46707549283155
--------------------------------------------------------------------------------------------------------
Exponential Moving Averages (Ratios) -              MAPE:  0.17783550194861494       MSE:  378.34610215053766
Exponential Moving Averages (2016 Values) -         MAPE:  0.1350915263669572       MSE:  159.73614471326164

Plese Note:- The above comparisons are made using Jan 2015 and Jan 2016 only

From the above matrix it is inferred that the best forecasting model for our prediction would be:- $\begin{align}P^{'}_{t} = \alpha*P_{t-1} + (1-\alpha)*P^{'}_{t-1} \end{align}$ i.e Exponential Moving Averages using 2016 Values

Regression Models

Train-Test Split

Before we start predictions using the tree based regression models we take 3 months of 2016 pickup data and split it such that for every region we have 70% data in train and 30% in test, ordered date-wise for every region

In [71]:
# Preparing data to be split into train and test, The below prepares data in cumulative form which will be later split into test and train
# number of 10min indices for jan 2015= 24*31*60/10 = 4464
# number of 10min indices for jan 2016 = 24*31*60/10 = 4464
# number of 10min indices for feb 2016 = 24*29*60/10 = 4176
# number of 10min indices for march 2016 = 24*31*60/10 = 4464
# regions_cum: it will contain 40 lists, each list will contain 4464+4176+4464 values which represents the number of pickups 
# that are happened for three months in 2016 data

# print(len(regions_cum))
# 40
# print(len(regions_cum[0]))
# 12960

# we take number of pickups that are happened in last 5 10min intravels
number_of_time_stamps = 5

# output varaible
# it is list of lists
# it will contain number of pickups 13099 for each cluster
output = []


# tsne_lat will contain 13104-5=13099 times lattitude of cluster center for every cluster
# Ex: [[cent_lat 13099times],[cent_lat 13099times], [cent_lat 13099times].... 40 lists]
# it is list of lists
tsne_lat = []


# tsne_lon will contain 13104-5=13099 times logitude of cluster center for every cluster
# Ex: [[cent_long 13099times],[cent_long 13099times], [cent_long 13099times].... 40 lists]
# it is list of lists
tsne_lon = []

# we will code each day 
# sunday = 0, monday=1, tue = 2, wed=3, thur=4, fri=5,sat=6
# for every cluster we will be adding 13099 values, each value represent to which day of the week that pickup bin belongs to
# it is list of lists
tsne_weekday = []

# its an numbpy array, of shape (523960, 5)
# each row corresponds to an entry in out data
# for the first row we will have [f0,f1,f2,f3,f4] fi=number of pickups happened in i+1th 10min intravel(bin)
# the second row will have [f1,f2,f3,f4,f5]
# the third row will have [f2,f3,f4,f5,f6]
# and so on...
tsne_feature = []


tsne_feature = [0]*number_of_time_stamps
for i in range(0,40):
    tsne_lat.append([kmeans.cluster_centers_[i][0]]*13099)
    tsne_lon.append([kmeans.cluster_centers_[i][1]]*13099)
    # jan 1st 2016 is thursday, so we start our day from 4: "(int(k/144))%7+4"
    # our prediction start from 5th 10min intravel since we need to have number of pickups that are happened in last 5 pickup bins
    tsne_weekday.append([int(((int(k/144))%7+4)%7) for k in range(5,4464+4176+4464)])
    # regions_cum is a list of lists [[x1,x2,x3..x13104], [x1,x2,x3..x13104], [x1,x2,x3..x13104], [x1,x2,x3..x13104], [x1,x2,x3..x13104], .. 40 lsits]
    tsne_feature = np.vstack((tsne_feature, [regions_cum[i][r:r+number_of_time_stamps] for r in range(0,len(regions_cum[i])-number_of_time_stamps)]))
    output.append(regions_cum[i][5:])
tsne_feature = tsne_feature[1:]
In [72]:
len(tsne_lat[0])*len(tsne_lat) == tsne_feature.shape[0] == len(tsne_weekday)*len(tsne_weekday[0]) == 40*13099 == len(output)*len(output[0])
Out[72]:
True
In [73]:
# Getting the predictions of exponential moving averages to be used as a feature in cumulative form

# upto now we computed 8 features for every data point that starts from 50th min of the day
# 1. cluster center lattitude
# 2. cluster center longitude
# 3. day of the week 
# 4. f_t_1: number of pickups that are happened previous t-1th 10min intravel
# 5. f_t_2: number of pickups that are happened previous t-2th 10min intravel
# 6. f_t_3: number of pickups that are happened previous t-3th 10min intravel
# 7. f_t_4: number of pickups that are happened previous t-4th 10min intravel
# 8. f_t_5: number of pickups that are happened previous t-5th 10min intravel

# from the baseline models we said the exponential weighted moving avarage gives us the best error
# we will try to add the same exponential weighted moving avarage at t as a feature to our data
# exponential weighted moving avarage => p'(t) = alpha*p'(t-1) + (1-alpha)*P(t-1) 
alpha=0.3

# it is a temporary array that store exponential weighted moving avarage for each 10min intravel, 
# for each cluster it will get reset
# for every cluster it contains 13104 values
predicted_values=[]

# it is similar like tsne_lat
# it is list of lists
# predict_list is a list of lists [[x5,x6,x7..x13104], [x5,x6,x7..x13104], [x5,x6,x7..x13104], [x5,x6,x7..x13104], [x5,x6,x7..x13104], .. 40 lsits]
predict_list = []
tsne_flat_exp_avg = []
for r in range(0,40):
    for i in range(0,13104):
        if i==0:
            predicted_value= regions_cum[r][0]
            predicted_values.append(0)
            continue
        predicted_values.append(predicted_value)
        predicted_value =int((alpha*predicted_value) + (1-alpha)*(regions_cum[r][i]))
    predict_list.append(predicted_values[5:])
    predicted_values=[]
In [74]:
amplitude_lists = []
frequency_lists = []
for i in range(40):
    ampli  = np.abs(np.fft.fft(regions_cum[i][0:13099]))
    freq = np.abs(np.fft.fftfreq(13099, 1))
    ampli_indices = np.argsort(-ampli)[1:]        #it will return an array of indices for which corresponding amplitude values are sorted in reverse order.
    amplitude_values = []
    frequency_values = []
    for j in range(0, 9, 2):   #taking top five amplitudes and frequencies
        amplitude_values.append(ampli[ampli_indices[j]])
        frequency_values.append(freq[ampli_indices[j]])
    for k in range(13099):    #those top 5 frequencies and amplitudes are same for all the points in one cluster
        amplitude_lists.append(amplitude_values)
        frequency_lists.append(frequency_values)
In [75]:
# train, test split : 70% 30% split
# Before we start predictions using the tree based regression models we take 3 months of 2016 pickup data 
# and split it such that for every region we have 70% data in train and 30% in test,
# ordered date-wise for every region
print("size of train data :", int(13099*0.7))
print("size of test data :", int(13099*0.3))
size of train data : 9169
size of test data : 3929
In [76]:
# extracting first 9169 timestamp values i.e 70% of 13099 (total timestamps) for our training data
train_features =  [tsne_feature[i*13099:(13099*i+9169)] for i in range(0,40)]
# temp = [0]*(12955 - 9068)
test_features = [tsne_feature[(i*13099)+9169:(13099*(i+1))] for i in range(0,40)]
In [77]:
print("Number of data clusters",len(train_features), "Number of data points in trian data", len(train_features[0]), "Each data point contains", len(train_features[0][0]),"features")
print("Number of data clusters",len(train_features), "Number of data points in test data", len(test_features[0]), "Each data point contains", len(test_features[0][0]),"features")
Number of data clusters 40 Number of data points in trian data 9169 Each data point contains 5 features
Number of data clusters 40 Number of data points in test data 3930 Each data point contains 5 features
In [78]:
train_fourier_frequencies=[frequency_lists[i*13099:(13099*i+9169)] for i in range(0,40)]
test_fourier_frequencies = [frequency_lists[(i*13099)+9169:(13099*(i+1))] for i in range(0,40)]
In [79]:
train_fourier_amplitudes=[amplitude_lists[i*13099:(13099*i+9169)] for i in range(0,40)]
test_fourier_amplitudes = [amplitude_lists[(i*13099)+9169:(13099*(i+1))] for i in range(0,40)]
In [80]:
# extracting first 9169 timestamp values i.e 70% of 13099 (total timestamps) for our training data
tsne_train_flat_lat = [i[:9169] for i in tsne_lat]
tsne_train_flat_lon = [i[:9169] for i in tsne_lon]
tsne_train_flat_weekday = [i[:9169] for i in tsne_weekday]
tsne_train_flat_output = [i[:9169] for i in output]
tsne_train_flat_exp_avg = [i[:9169] for i in predict_list]
In [81]:
# extracting the rest of the timestamp values i.e 30% of 12956 (total timestamps) for our test data
tsne_test_flat_lat = [i[9169:] for i in tsne_lat]
tsne_test_flat_lon = [i[9169:] for i in tsne_lon]
tsne_test_flat_weekday = [i[9169:] for i in tsne_weekday]
tsne_test_flat_output = [i[9169:] for i in output]
tsne_test_flat_exp_avg = [i[9169:] for i in predict_list]
In [82]:
train_freq=[]
test_freq=[]
train_amp=[]
test_amp=[]
for i in range(40):
    train_freq.extend(train_fourier_frequencies[i])
    test_freq.extend(test_fourier_frequencies[i])
    train_amp.extend(train_fourier_amplitudes[i])
    test_amp.extend(test_fourier_amplitudes[i])
In [83]:
# the above contains values in the form of list of lists (i.e. list of values of each region), here we make all of them in one list
train_new_features = []
for i in range(0,40):
    train_new_features.extend(train_features[i])
test_new_features = []
for i in range(0,40):
    test_new_features.extend(test_features[i])
In [84]:
len(test_new_features)
Out[84]:
157200
In [85]:
train_new_freq_amp=np.hstack((train_freq,train_amp,train_new_features))
test_new_freq_amp=np.hstack((test_freq,test_amp,test_new_features))    
In [150]:
# converting lists of lists into sinle list i.e flatten
# a  = [[1,2,3,4],[4,6,7,8]]
# print(sum(a,[]))
# [1, 2, 3, 4, 4, 6, 7, 8]

tsne_train_lat = sum(tsne_train_flat_lat, [])
tsne_train_lon = sum(tsne_train_flat_lon, [])
tsne_train_weekday = sum(tsne_train_flat_weekday, [])
tsne_train_output = sum(tsne_train_flat_output, [])
tsne_train_exp_avg = sum(tsne_train_flat_exp_avg,[])
In [151]:
# converting lists of lists into sinle list i.e flatten
# a  = [[1,2,3,4],[4,6,7,8]]
# print(sum(a,[]))
# [1, 2, 3, 4, 4, 6, 7, 8]

tsne_test_lat = sum(tsne_test_flat_lat, [])
tsne_test_lon = sum(tsne_test_flat_lon, [])
tsne_test_weekday = sum(tsne_test_flat_weekday, [])
tsne_test_output = sum(tsne_test_flat_output, [])
tsne_test_exp_avg = sum(tsne_test_flat_exp_avg,[])
In [152]:
# Preparing the data frame for our train data
columns = ['freq1','freq2','freq3','freq4','freq5','amp1','amp2','amp3','amp4','amp5','ft_5','ft_4','ft_3','ft_2','ft_1']
df_train = pd.DataFrame(data=train_new_freq_amp, columns=columns) 
df_train['lat'] = tsne_train_lat
df_train['lon'] = tsne_train_lon
df_train['weekday'] = tsne_train_weekday
df_train['exp_avg'] = tsne_train_exp_avg
print(df_train.shape)
(366760, 19)
In [153]:
# Preparing the data frame for our train data
df_test = pd.DataFrame(data=test_new_freq_amp, columns=columns) 
df_test['lat'] = tsne_test_lat
df_test['lon'] = tsne_test_lon
df_test['weekday'] = tsne_test_weekday
df_test['exp_avg'] = tsne_test_exp_avg
print(df_test.shape)
(157200, 19)
In [154]:
df_train.head()
Out[154]:
freq1 freq2 freq3 freq4 freq5 amp1 amp2 amp3 amp4 amp5 ft_5 ft_4 ft_3 ft_2 ft_1 lat lon weekday exp_avg
0 0.006947 0.013894 0.012902 0.034735 0.00794 365809.640884 186621.632173 82362.984238 63773.192055 62261.203509 0.0 63.0 217.0 189.0 137.0 40.776228 -73.982119 4 150
1 0.006947 0.013894 0.012902 0.034735 0.00794 365809.640884 186621.632173 82362.984238 63773.192055 62261.203509 63.0 217.0 189.0 137.0 135.0 40.776228 -73.982119 4 139
2 0.006947 0.013894 0.012902 0.034735 0.00794 365809.640884 186621.632173 82362.984238 63773.192055 62261.203509 217.0 189.0 137.0 135.0 129.0 40.776228 -73.982119 4 132
3 0.006947 0.013894 0.012902 0.034735 0.00794 365809.640884 186621.632173 82362.984238 63773.192055 62261.203509 189.0 137.0 135.0 129.0 150.0 40.776228 -73.982119 4 144
4 0.006947 0.013894 0.012902 0.034735 0.00794 365809.640884 186621.632173 82362.984238 63773.192055 62261.203509 137.0 135.0 129.0 150.0 164.0 40.776228 -73.982119 4 158
In [155]:
tsne_train_output = pd.DataFrame(tsne_train_output,columns=['no of pickups'])
tsne_test_output = pd.DataFrame(tsne_test_output,columns=['no of pickups'])
In [ ]:
from sqlalchemy import create_engine
engine = create_engine('sqlite:///data.db')
df_train.to_sql('df_train', engine, if_exists='replace')
print(df_train.shape)
tsne_train_output.to_sql('tsne_train_output', engine, if_exists='replace')
df_test.to_sql('df_test', engine, if_exists='replace')
print(df_test.shape)
tsne_test_output.to_sql('tsne_test_output', engine, if_exists='replace')
In [15]:
def f(x):
    #Clean even numbers from columns.Dont use exp for amplitudes
    return [np.log10(x),np.log(x),np.sqrt(x),np.cbrt(x),np.sin(x),np.cos(x),np.tan(x)]
df_train[['freq1_log10','freq1_log','freq1_sqrt','freq1_cbrt','freq1_sin','freq1_cos','freq1_tan']] = df_train.freq1.apply(f).apply(pd.Series)
df_train[['freq2_log10','freq2_log','freq2_sqrt','freq2_cbrt','freq2_sin','freq2_cos','freq2_tan']] = df_train.freq2.apply(f).apply(pd.Series)
df_train[['freq3_log10','freq3_log','freq3_sqrt','freq3_cbrt','freq3_sin','freq3_cos','freq3_tan']] = df_train.freq3.apply(f).apply(pd.Series)
df_train[['freq4_log10','freq4_log','freq4_sqrt','freq4_cbrt','freq4_sin','freq4_cos','freq4_tan']] = df_train.freq4.apply(f).apply(pd.Series)
df_train[['freq5_log10','freq5_log','freq5_sqrt','freq5_cbrt','freq5_sin','freq5_cos','freq5_tan']] = df_train.freq5.apply(f).apply(pd.Series)

df_train[['amp1_log10','amp1_log','amp1_sqrt','amp1_cbrt','amp1_sin','amp1_cos','amp1_tan']] = df_train.amp1.apply(f).apply(pd.Series)
df_train[['amp2_log10','amp2_log','amp2_sqrt','amp2_cbrt','amp2_sin','amp2_cos','amp2_tan']] = df_train.amp2.apply(f).apply(pd.Series)
df_train[['amp3_log10','amp3_log','amp3_sqrt','amp3_cbrt','amp3_sin','amp3_cos','amp3_tan']] = df_train.amp3.apply(f).apply(pd.Series)
df_train[['amp4_log10','amp4_log','amp4_sqrt','amp4_cbrt','amp4_sin','amp4_cos','amp4_tan']] = df_train.amp4.apply(f).apply(pd.Series)
df_train[['amp5_log10','amp5_log','amp5_sqrt','amp5_cbrt','amp5_sin','amp5_cos','amp5_tan']] = df_train.amp5.apply(f).apply(pd.Series)

df_test[['freq1_log10','freq1_log','freq1_sqrt','freq1_cbrt','freq1_sin','freq1_cos','freq1_tan']] = df_test.freq1.apply(f).apply(pd.Series)
df_test[['freq2_log10','freq2_log','freq2_sqrt','freq2_cbrt','freq2_sin','freq2_cos','freq2_tan']] = df_test.freq2.apply(f).apply(pd.Series)
df_test[['freq3_log10','freq3_log','freq3_sqrt','freq3_cbrt','freq3_sin','freq3_cos','freq3_tan']] = df_test.freq3.apply(f).apply(pd.Series)
df_test[['freq4_log10','freq4_log','freq4_sqrt','freq4_cbrt','freq4_sin','freq4_cos','freq4_tan']] = df_test.freq4.apply(f).apply(pd.Series)
df_test[['freq5_log10','freq5_log','freq5_sqrt','freq5_cbrt','freq5_sin','freq5_cos','freq5_tan']] = df_test.freq5.apply(f).apply(pd.Series)

df_test[['amp1_log10','amp1_log','amp1_sqrt','amp1_cbrt','amp1_sin','amp1_cos','amp1_tan']] = df_test.amp1.apply(f).apply(pd.Series)
df_test[['amp2_log10','amp2_log','amp2_sqrt','amp2_cbrt','amp2_sin','amp2_cos','amp2_tan']] = df_test.amp2.apply(f).apply(pd.Series)
df_test[['amp3_log10','amp3_log','amp3_sqrt','amp3_cbrt','amp3_sin','amp3_cos','amp3_tan']] = df_test.amp3.apply(f).apply(pd.Series)
df_test[['amp4_log10','amp4_log','amp4_sqrt','amp4_cbrt','amp4_sin','amp4_cos','amp4_tan']] = df_test.amp4.apply(f).apply(pd.Series)
df_test[['amp5_log10','amp5_log','amp5_sqrt','amp5_cbrt','amp5_sin','amp5_cos','amp5_tan']] = df_test.amp5.apply(f).apply(pd.Series)
In [2]:
#http://www.sqlitetutorial.net/sqlite-python/create-tables/
from sqlalchemy import create_engine
import sqlite3
def create_connection(db_file):
    """ create a database connection to the SQLite database
        specified by db_file
    :param db_file: database file
    :return: Connection object or None
    """
    try:
        conn = sqlite3.connect(db_file)
        return conn
    except Error as e:
        print(e)
 
    return None


def checkTableExists(dbcon):
    cursr = dbcon.cursor()
    str = "select name from sqlite_master where type='table'"
    table_names = cursr.execute(str)
    print("Tables in the databse:")
    tables =table_names.fetchall() 
    print(tables[0][0])
    return(len(tables))#http://www.sqlitetutorial.net/sqlite-python/create-tables/
In [3]:
read_db = 'data.db'
conn_r = create_connection(read_db)
checkTableExists(conn_r)
conn_r.close()
Tables in the databse:
df_train
In [4]:
# try to sample data according to the computing power you have
if os.path.isfile(read_db):
    conn_r = create_connection(read_db)
    if conn_r is not None:
        # for selecting first 1M rows
        # data = pd.read_sql_query("""SELECT * FROM data LIMIT 100001;""", conn_r)
        
        # for selecting random points
        df_train = pd.read_sql_query("SELECT * From df_train ;", conn_r,index_col='index')
        df_test = pd.read_sql_query("SELECT * From df_test ;", conn_r,index_col='index')
        tsne_train_output = pd.read_sql_query("SELECT * From tsne_train_output ;", conn_r,index_col='index')
        tsne_test_output = pd.read_sql_query("SELECT * From tsne_test_output ;", conn_r,index_col='index')
        conn_r.commit()
        conn_r.close()
In [5]:
tsne_train_output=tsne_train_output['no of pickups'].tolist()
tsne_test_output=tsne_test_output['no of pickups'].tolist()
In [6]:
ab=pd.Series(tsne_train_output)
df_train['no of pickups']=ab.values
cd=pd.Series(tsne_test_output)
df_test['no of pickups']=cd.values
In [8]:
df_train=df_train[:36676]
df_test=df_test[:15720]
In [64]:
print(df_train.shape)
print(df_test.shape)
(366760, 20)
(157200, 20)

Using Linear Regression

In [158]:
from sklearn.linear_model import SGDRegressor
from sklearn.model_selection import GridSearchCV
lr_reg = SGDRegressor(loss = "squared_loss", penalty = "l2")
values = [10**-14, 10**-12, 10**-10, 10**-8, 10**-6, 10**-4, 10**-2, 10**0, 10**2, 10**4, 10**6]
parameters = {'alpha': values}
grid = GridSearchCV(lr_reg,parameters, cv=5)
grid.fit(df_train, tsne_train_output)
best_score = grid.best_score_
best_params = grid.best_params_
print(best_score)
print(best_params)
-6.125738985950115e+34
{'alpha': 1e-12}
In [159]:
# find more about LinearRegression function here http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html
# -------------------------
# default paramters
# sklearn.linear_model.LinearRegression(fit_intercept=True, normalize=False, copy_X=True, n_jobs=1)

# some of methods of LinearRegression()
# fit(X, y[, sample_weight])	Fit linear model.
# get_params([deep])	Get parameters for this estimator.
# predict(X)	Predict using the linear model
# score(X, y[, sample_weight])	Returns the coefficient of determination R^2 of the prediction.
# set_params(**params)	Set the parameters of this estimator.
# -----------------------
# video link: https://www.appliedaicourse.com/course/applied-ai-course-online/lessons/geometric-intuition-1-2-copy-8/
# -----------------------


lr_reg=SGDRegressor(loss = "squared_loss", penalty = "l2",alpha=1e-12)
lr_reg.fit(df_train, tsne_train_output)

y_pred = lr_reg.predict(df_test)
lr_test_predictions = [round(value) for value in y_pred]
y_pred = lr_reg.predict(df_train)
lr_train_predictions = [round(value) for value in y_pred]
print((mean_absolute_error(tsne_train_output, lr_train_predictions))/(sum(tsne_train_output)/len(tsne_train_output)))
8.811453915909701e+16

Using Random Forest Regressor

In [160]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint as sp_randint
param_dist = {"n_estimators":sp_randint(10,200),
              "max_depth": sp_randint(10,15),
              "min_samples_split": sp_randint(110,190),
              "min_samples_leaf": sp_randint(25,65)}
regr1 = RandomForestRegressor(max_features='sqrt',n_jobs=-1)
rf_random = RandomizedSearchCV(regr1, param_distributions=param_dist,
                                   n_iter=5,cv=10,scoring='neg_mean_absolute_error',random_state=25)
rf_random.fit(df_train, tsne_train_output)
print(rf_random.best_estimator_)
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=14,
           max_features='sqrt', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=28, min_samples_split=111,
           min_weight_fraction_leaf=0.0, n_estimators=160, n_jobs=-1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)
In [161]:
# Training a hyper-parameter tuned random forest regressor on our train data
# find more about LinearRegression function here http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html
# -------------------------
# default paramters
# sklearn.ensemble.RandomForestRegressor(n_estimators=10, criterion=’mse’, max_depth=None, min_samples_split=2, 
# min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’, max_leaf_nodes=None, min_impurity_decrease=0.0, 
# min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=1, random_state=None, verbose=0, warm_start=False)

# some of methods of RandomForestRegressor()
# apply(X)	Apply trees in the forest to X, return leaf indices.
# decision_path(X)	Return the decision path in the forest
# fit(X, y[, sample_weight])	Build a forest of trees from the training set (X, y).
# get_params([deep])	Get parameters for this estimator.
# predict(X)	Predict regression target for X.
# score(X, y[, sample_weight])	Returns the coefficient of determination R^2 of the prediction.
# -----------------------
# video link1: https://www.appliedaicourse.com/course/applied-ai-course-online/lessons/regression-using-decision-trees-2/
# video link2: https://www.appliedaicourse.com/course/applied-ai-course-online/lessons/what-are-ensembles/
# -----------------------

regr1 = RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=14,
           max_features='sqrt', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=28, min_samples_split=111,
           min_weight_fraction_leaf=0.0, n_estimators=160, n_jobs=-1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)
regr1.fit(df_train, tsne_train_output)
Out[161]:
RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=14,
           max_features='sqrt', max_leaf_nodes=None,
           min_impurity_decrease=0.0, min_impurity_split=None,
           min_samples_leaf=28, min_samples_split=111,
           min_weight_fraction_leaf=0.0, n_estimators=160, n_jobs=-1,
           oob_score=False, random_state=None, verbose=0, warm_start=False)
In [162]:
# Predicting on test data using our trained random forest model 

# the models regr1 is already hyper parameter tuned
# the parameters that we got above are found using grid search

y_pred = regr1.predict(df_test)
rndf_test_predictions = [round(value) for value in y_pred]
y_pred = regr1.predict(df_train)
rndf_train_predictions = [round(value) for value in y_pred]
print((mean_absolute_error(tsne_train_output,rndf_train_predictions))/(sum(tsne_train_output)/len(tsne_train_output)))
0.12597363533359804
In [163]:
#feature importances based on analysis using random forest
print (df_train.columns)
print (regr1.feature_importances_)
Index(['freq1', 'freq2', 'freq3', 'freq4', 'freq5', 'amp1', 'amp2', 'amp3',
       'amp4', 'amp5', 'ft_5', 'ft_4', 'ft_3', 'ft_2', 'ft_1', 'lat', 'lon',
       'weekday', 'exp_avg'],
      dtype='object')
[1.56003861e-04 3.85926275e-04 2.75025116e-04 2.22785427e-04
 2.07521846e-04 4.40425540e-03 4.73031111e-03 5.68278389e-03
 5.05183407e-03 1.01272128e-02 8.54283762e-02 9.64844963e-02
 1.51327585e-01 1.63475992e-01 2.46547800e-01 7.92476608e-04
 6.36425514e-04 1.59636494e-04 2.23903552e-01]

Using XgBoost Regressor

In [17]:
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint as sp_randint
param_dist = {"n_estimators":sp_randint(10,300),
              "max_depth": sp_randint(1,6)}
xgb_model = xgb.XGBRegressor()
rf_random = RandomizedSearchCV(xgb_model, param_distributions=param_dist,
                                   cv=3,scoring='neg_mean_absolute_error')
rf_random.fit(df_train, tsne_train_output)
print(rf_random.best_estimator_)
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=207,
       n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)
In [18]:
# Training a hyper-parameter tuned Xg-Boost regressor on our train data

# find more about XGBRegressor function here http://xgboost.readthedocs.io/en/latest/python/python_api.html?#module-xgboost.sklearn
# -------------------------
# default paramters
# xgboost.XGBRegressor(max_depth=3, learning_rate=0.1, n_estimators=100, silent=True, objective='reg:linear', 
# booster='gbtree', n_jobs=1, nthread=None, gamma=0, min_child_weight=1, max_delta_step=0, subsample=1, colsample_bytree=1, 
# colsample_bylevel=1, reg_alpha=0, reg_lambda=1, scale_pos_weight=1, base_score=0.5, random_state=0, seed=None, 
# missing=None, **kwargs)

# some of methods of RandomForestRegressor()
# fit(X, y, sample_weight=None, eval_set=None, eval_metric=None, early_stopping_rounds=None, verbose=True, xgb_model=None)
# get_params([deep])	Get parameters for this estimator.
# predict(data, output_margin=False, ntree_limit=0) : Predict with data. NOTE: This function is not thread safe.
# get_score(importance_type='weight') -> get the feature importance
# -----------------------
# video link1: https://www.appliedaicourse.com/course/applied-ai-course-online/lessons/regression-using-decision-trees-2/
# video link2: https://www.appliedaicourse.com/course/applied-ai-course-online/lessons/what-are-ensembles/
# -----------------------

x_model = xgb.XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=207,
       n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)
x_model.fit(df_train, tsne_train_output)
Out[18]:
XGBRegressor(base_score=0.5, booster='gbtree', colsample_bylevel=1,
       colsample_bytree=1, gamma=0, learning_rate=0.1, max_delta_step=0,
       max_depth=3, min_child_weight=1, missing=None, n_estimators=207,
       n_jobs=1, nthread=None, objective='reg:linear', random_state=0,
       reg_alpha=0, reg_lambda=1, scale_pos_weight=1, seed=None,
       silent=True, subsample=1)
In [19]:
#predicting with our trained Xg-Boost regressor
# the models x_model is already hyper parameter tuned
# the parameters that we got above are found using grid search

y_pred = x_model.predict(df_test)
xgb_test_predictions = [round(value) for value in y_pred]
y_pred = x_model.predict(df_train)
xgb_train_predictions = [round(value) for value in y_pred]
print((mean_absolute_error(tsne_train_output, xgb_train_predictions))/(sum(tsne_train_output)/len(tsne_train_output)))
0.13033036720227303

Calculating the error metric values for various models

In [172]:
train_mape=[]
test_mape=[]

train_mape.append((mean_absolute_error(tsne_train_output,df_train['ft_1'].values))/(sum(tsne_train_output)/len(tsne_train_output)))
train_mape.append((mean_absolute_error(tsne_train_output,df_train['exp_avg'].values))/(sum(tsne_train_output)/len(tsne_train_output)))
train_mape.append((mean_absolute_error(tsne_train_output,rndf_train_predictions))/(sum(tsne_train_output)/len(tsne_train_output)))
train_mape.append((mean_absolute_error(tsne_train_output, xgb_train_predictions))/(sum(tsne_train_output)/len(tsne_train_output)))
train_mape.append((mean_absolute_error(tsne_train_output, lr_train_predictions))/(sum(tsne_train_output)/len(tsne_train_output)))

test_mape.append((mean_absolute_error(tsne_test_output, df_test['ft_1'].values))/(sum(tsne_test_output)/len(tsne_test_output)))
test_mape.append((mean_absolute_error(tsne_test_output, df_test['exp_avg'].values))/(sum(tsne_test_output)/len(tsne_test_output)))
test_mape.append((mean_absolute_error(tsne_test_output, rndf_test_predictions))/(sum(tsne_test_output)/len(tsne_test_output)))
test_mape.append((mean_absolute_error(tsne_test_output, xgb_test_predictions))/(sum(tsne_test_output)/len(tsne_test_output)))
test_mape.append((mean_absolute_error(tsne_test_output, lr_test_predictions))/(sum(tsne_test_output)/len(tsne_test_output)))
In [173]:
print ("Error Metric Matrix (Tree Based Regression Methods) -  MAPE")
print ("--------------------------------------------------------------------------------------------------------")
print ("Baseline Model -                             Train: ",train_mape[0],"      Test: ",test_mape[0])
print ("Exponential Averages Forecasting -           Train: ",train_mape[1],"      Test: ",test_mape[1])
print ("Linear Regression -                         Train: ",train_mape[3],"      Test: ",test_mape[3])
print ("Random Forest Regression -                   Train: ",train_mape[2],"     Test: ",test_mape[2])
Error Metric Matrix (Tree Based Regression Methods) -  MAPE
--------------------------------------------------------------------------------------------------------
Baseline Model -                             Train:  0.14005275878666593       Test:  0.13653125704827038
Exponential Averages Forecasting -           Train:  0.13289968436017227       Test:  0.12936180420430524
Linear Regression -                         Train:  0.12834821987999948       Test:  0.12631456215666562
Random Forest Regression -                   Train:  0.12597363533359804      Test:  0.12590409000378458

Error Metric Matrix

In [174]:
print ("Error Metric Matrix (Tree Based Regression Methods) -  MAPE")
print ("--------------------------------------------------------------------------------------------------------")
print ("Baseline Model -                             Train: ",train_mape[0],"      Test: ",test_mape[0])
print ("Exponential Averages Forecasting -           Train: ",train_mape[1],"      Test: ",test_mape[1])
print ("Linear Regression -                         Train: ",train_mape[4],"      Test: ",test_mape[4])
print ("Random Forest Regression -                   Train: ",train_mape[2],"     Test: ",test_mape[2])
print ("XgBoost Regression -                         Train: ",train_mape[3],"      Test: ",test_mape[3])
print ("--------------------------------------------------------------------------------------------------------")
Error Metric Matrix (Tree Based Regression Methods) -  MAPE
--------------------------------------------------------------------------------------------------------
Baseline Model -                             Train:  0.14005275878666593       Test:  0.13653125704827038
Exponential Averages Forecasting -           Train:  0.13289968436017227       Test:  0.12936180420430524
Linear Regression -                         Train:  8.811453915909701e+16       Test:  8.136965240914746e+16
Random Forest Regression -                   Train:  0.12597363533359804      Test:  0.12590409000378458
XgBoost Regression -                         Train:  0.12834821987999948       Test:  0.12631456215666562
--------------------------------------------------------------------------------------------------------
In [7]:
#https://machinelearningmastery.com/time-series-prediction-lstm-recurrent-neural-networks-python-keras/
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import LSTM
from sklearn.preprocessing import MinMaxScaler
from sklearn.metrics import mean_squared_error
import numpy as np
Using TensorFlow backend.
In [8]:
np.random.seed(7)
In [9]:
scaler = MinMaxScaler(feature_range=(0, 1))
df_train = scaler.fit_transform(df_train)
df_test=scaler.transform(df_test)
In [10]:
# convert an array of values into a dataset matrix
def create_dataset(dataset, look_back=1):
    dataX, dataY = [], []
    #print(len(dataset)-look_back-1)
    for i in range(len(dataset)-look_back-1):
        #print(dataset[i, 19])
        a = dataset[i, 19]
        dataX.append(a)
        #print(dataset[i+look_back, 19])
        dataY.append(dataset[i + look_back, 19])
        #print("--"*20)
    return np.array(dataX), np.array(dataY)
In [11]:
df_training=df_train[:293408]
print(df_train.shape)
cross_val=df_train[293408:]
print(cross_val.shape)
(366760, 20)
(73352, 20)
In [14]:
look_back = 1
trainX, trainY = create_dataset(df_training, look_back)
valX,valY= create_dataset(cross_val,look_back)
testX, testY = create_dataset(df_test, look_back)
In [15]:
trainX = np.reshape(trainX, (trainX.shape[0], 1,look_back))
valX   = np.reshape(valX,(valX.shape[0],1,look_back))
testX  = np.reshape(testX, (testX.shape[0], 1,look_back))
In [16]:
import keras
model = Sequential()
model.add(LSTM(4, input_shape=(1, look_back)))
model.add(Dense(1))
keras.backend.set_epsilon(1)
model.compile(loss='mse', optimizer='adam',metrics=['mae','mape'])
model.fit(trainX, trainY, epochs=200, batch_size=100, verbose=2,validation_data=(valX,valY))
WARNING:tensorflow:From C:\Users\mchetankumar\AppData\Local\Continuum\anaconda3\envs\TaxiEnv\lib\site-packages\tensorflow\python\framework\op_def_library.py:263: colocate_with (from tensorflow.python.framework.ops) is deprecated and will be removed in a future version.
Instructions for updating:
Colocations handled automatically by placer.
WARNING:tensorflow:From C:\Users\mchetankumar\AppData\Local\Continuum\anaconda3\envs\TaxiEnv\lib\site-packages\tensorflow\python\ops\math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
Train on 293406 samples, validate on 73350 samples
Epoch 1/200
 - 13s - loss: 0.0224 - mean_absolute_error: 0.1126 - mean_absolute_percentage_error: 11.2598 - val_loss: 0.0191 - val_mean_absolute_error: 0.1094 - val_mean_absolute_percentage_error: 10.9384
Epoch 2/200
 - 10s - loss: 0.0188 - mean_absolute_error: 0.1113 - mean_absolute_percentage_error: 11.1257 - val_loss: 0.0185 - val_mean_absolute_error: 0.1079 - val_mean_absolute_percentage_error: 10.7928
Epoch 3/200
 - 11s - loss: 0.0183 - mean_absolute_error: 0.1097 - mean_absolute_percentage_error: 10.9731 - val_loss: 0.0180 - val_mean_absolute_error: 0.1064 - val_mean_absolute_percentage_error: 10.6405
Epoch 4/200
 - 11s - loss: 0.0178 - mean_absolute_error: 0.1082 - mean_absolute_percentage_error: 10.8174 - val_loss: 0.0175 - val_mean_absolute_error: 0.1049 - val_mean_absolute_percentage_error: 10.4854
Epoch 5/200
 - 10s - loss: 0.0172 - mean_absolute_error: 0.1066 - mean_absolute_percentage_error: 10.6588 - val_loss: 0.0170 - val_mean_absolute_error: 0.1033 - val_mean_absolute_percentage_error: 10.3318
Epoch 6/200
 - 10s - loss: 0.0167 - mean_absolute_error: 0.1050 - mean_absolute_percentage_error: 10.4988 - val_loss: 0.0165 - val_mean_absolute_error: 0.1018 - val_mean_absolute_percentage_error: 10.1776
Epoch 7/200
 - 10s - loss: 0.0162 - mean_absolute_error: 0.1034 - mean_absolute_percentage_error: 10.3398 - val_loss: 0.0160 - val_mean_absolute_error: 0.1002 - val_mean_absolute_percentage_error: 10.0174
Epoch 8/200
 - 10s - loss: 0.0157 - mean_absolute_error: 0.1017 - mean_absolute_percentage_error: 10.1737 - val_loss: 0.0155 - val_mean_absolute_error: 0.0986 - val_mean_absolute_percentage_error: 9.8576
Epoch 9/200
 - 11s - loss: 0.0152 - mean_absolute_error: 0.1001 - mean_absolute_percentage_error: 10.0083 - val_loss: 0.0150 - val_mean_absolute_error: 0.0969 - val_mean_absolute_percentage_error: 9.6916
Epoch 10/200
 - 11s - loss: 0.0147 - mean_absolute_error: 0.0983 - mean_absolute_percentage_error: 9.8337 - val_loss: 0.0145 - val_mean_absolute_error: 0.0953 - val_mean_absolute_percentage_error: 9.5322
Epoch 11/200
 - 10s - loss: 0.0142 - mean_absolute_error: 0.0967 - mean_absolute_percentage_error: 9.6654 - val_loss: 0.0140 - val_mean_absolute_error: 0.0936 - val_mean_absolute_percentage_error: 9.3588
Epoch 12/200
 - 11s - loss: 0.0137 - mean_absolute_error: 0.0949 - mean_absolute_percentage_error: 9.4880 - val_loss: 0.0135 - val_mean_absolute_error: 0.0918 - val_mean_absolute_percentage_error: 9.1843
Epoch 13/200
 - 11s - loss: 0.0132 - mean_absolute_error: 0.0931 - mean_absolute_percentage_error: 9.3059 - val_loss: 0.0130 - val_mean_absolute_error: 0.0901 - val_mean_absolute_percentage_error: 9.0067
Epoch 14/200
 - 11s - loss: 0.0127 - mean_absolute_error: 0.0912 - mean_absolute_percentage_error: 9.1212 - val_loss: 0.0125 - val_mean_absolute_error: 0.0882 - val_mean_absolute_percentage_error: 8.8249
Epoch 15/200
 - 11s - loss: 0.0122 - mean_absolute_error: 0.0893 - mean_absolute_percentage_error: 8.9328 - val_loss: 0.0120 - val_mean_absolute_error: 0.0864 - val_mean_absolute_percentage_error: 8.6364
Epoch 16/200
 - 10s - loss: 0.0117 - mean_absolute_error: 0.0874 - mean_absolute_percentage_error: 8.7374 - val_loss: 0.0115 - val_mean_absolute_error: 0.0845 - val_mean_absolute_percentage_error: 8.4470
Epoch 17/200
 - 11s - loss: 0.0112 - mean_absolute_error: 0.0854 - mean_absolute_percentage_error: 8.5388 - val_loss: 0.0110 - val_mean_absolute_error: 0.0825 - val_mean_absolute_percentage_error: 8.2534
Epoch 18/200
 - 10s - loss: 0.0107 - mean_absolute_error: 0.0834 - mean_absolute_percentage_error: 8.3380 - val_loss: 0.0105 - val_mean_absolute_error: 0.0805 - val_mean_absolute_percentage_error: 8.0539
Epoch 19/200
 - 10s - loss: 0.0101 - mean_absolute_error: 0.0813 - mean_absolute_percentage_error: 8.1311 - val_loss: 0.0100 - val_mean_absolute_error: 0.0785 - val_mean_absolute_percentage_error: 7.8501
Epoch 20/200
 - 10s - loss: 0.0096 - mean_absolute_error: 0.0792 - mean_absolute_percentage_error: 7.9177 - val_loss: 0.0095 - val_mean_absolute_error: 0.0765 - val_mean_absolute_percentage_error: 7.6473
Epoch 21/200
 - 11s - loss: 0.0091 - mean_absolute_error: 0.0771 - mean_absolute_percentage_error: 7.7058 - val_loss: 0.0090 - val_mean_absolute_error: 0.0743 - val_mean_absolute_percentage_error: 7.4346
Epoch 22/200
 - 10s - loss: 0.0086 - mean_absolute_error: 0.0749 - mean_absolute_percentage_error: 7.4871 - val_loss: 0.0085 - val_mean_absolute_error: 0.0722 - val_mean_absolute_percentage_error: 7.2194
Epoch 23/200
 - 11s - loss: 0.0082 - mean_absolute_error: 0.0726 - mean_absolute_percentage_error: 7.2648 - val_loss: 0.0081 - val_mean_absolute_error: 0.0700 - val_mean_absolute_percentage_error: 7.0027
Epoch 24/200
 - 10s - loss: 0.0077 - mean_absolute_error: 0.0704 - mean_absolute_percentage_error: 7.0395 - val_loss: 0.0076 - val_mean_absolute_error: 0.0678 - val_mean_absolute_percentage_error: 6.7842
Epoch 25/200
 - 11s - loss: 0.0072 - mean_absolute_error: 0.0681 - mean_absolute_percentage_error: 6.8125 - val_loss: 0.0071 - val_mean_absolute_error: 0.0657 - val_mean_absolute_percentage_error: 6.5652
Epoch 26/200
 - 10s - loss: 0.0068 - mean_absolute_error: 0.0658 - mean_absolute_percentage_error: 6.5834 - val_loss: 0.0067 - val_mean_absolute_error: 0.0635 - val_mean_absolute_percentage_error: 6.3482
Epoch 27/200
 - 11s - loss: 0.0063 - mean_absolute_error: 0.0636 - mean_absolute_percentage_error: 6.3565 - val_loss: 0.0063 - val_mean_absolute_error: 0.0613 - val_mean_absolute_percentage_error: 6.1257
Epoch 28/200
 - 10s - loss: 0.0059 - mean_absolute_error: 0.0613 - mean_absolute_percentage_error: 6.1271 - val_loss: 0.0059 - val_mean_absolute_error: 0.0590 - val_mean_absolute_percentage_error: 5.9047
Epoch 29/200
 - 11s - loss: 0.0055 - mean_absolute_error: 0.0590 - mean_absolute_percentage_error: 5.8976 - val_loss: 0.0055 - val_mean_absolute_error: 0.0569 - val_mean_absolute_percentage_error: 5.6864
Epoch 30/200
 - 11s - loss: 0.0051 - mean_absolute_error: 0.0567 - mean_absolute_percentage_error: 5.6701 - val_loss: 0.0051 - val_mean_absolute_error: 0.0547 - val_mean_absolute_percentage_error: 5.4711
Epoch 31/200
 - 10s - loss: 0.0047 - mean_absolute_error: 0.0544 - mean_absolute_percentage_error: 5.4448 - val_loss: 0.0047 - val_mean_absolute_error: 0.0526 - val_mean_absolute_percentage_error: 5.2613
Epoch 32/200
 - 11s - loss: 0.0044 - mean_absolute_error: 0.0523 - mean_absolute_percentage_error: 5.2259 - val_loss: 0.0044 - val_mean_absolute_error: 0.0505 - val_mean_absolute_percentage_error: 5.0463
Epoch 33/200
 - 11s - loss: 0.0040 - mean_absolute_error: 0.0501 - mean_absolute_percentage_error: 5.0057 - val_loss: 0.0041 - val_mean_absolute_error: 0.0485 - val_mean_absolute_percentage_error: 4.8464
Epoch 34/200
 - 10s - loss: 0.0037 - mean_absolute_error: 0.0480 - mean_absolute_percentage_error: 4.7968 - val_loss: 0.0038 - val_mean_absolute_error: 0.0464 - val_mean_absolute_percentage_error: 4.6410
Epoch 35/200
 - 11s - loss: 0.0034 - mean_absolute_error: 0.0459 - mean_absolute_percentage_error: 4.5864 - val_loss: 0.0035 - val_mean_absolute_error: 0.0445 - val_mean_absolute_percentage_error: 4.4542
Epoch 36/200
 - 11s - loss: 0.0031 - mean_absolute_error: 0.0439 - mean_absolute_percentage_error: 4.3879 - val_loss: 0.0032 - val_mean_absolute_error: 0.0427 - val_mean_absolute_percentage_error: 4.2734
Epoch 37/200
 - 11s - loss: 0.0029 - mean_absolute_error: 0.0420 - mean_absolute_percentage_error: 4.1963 - val_loss: 0.0030 - val_mean_absolute_error: 0.0410 - val_mean_absolute_percentage_error: 4.0984
Epoch 38/200
 - 10s - loss: 0.0027 - mean_absolute_error: 0.0401 - mean_absolute_percentage_error: 4.0124 - val_loss: 0.0028 - val_mean_absolute_error: 0.0393 - val_mean_absolute_percentage_error: 3.9303
Epoch 39/200
 - 11s - loss: 0.0025 - mean_absolute_error: 0.0384 - mean_absolute_percentage_error: 3.8381 - val_loss: 0.0026 - val_mean_absolute_error: 0.0377 - val_mean_absolute_percentage_error: 3.7696
Epoch 40/200
 - 11s - loss: 0.0023 - mean_absolute_error: 0.0367 - mean_absolute_percentage_error: 3.6710 - val_loss: 0.0024 - val_mean_absolute_error: 0.0362 - val_mean_absolute_percentage_error: 3.6218
Epoch 41/200
 - 10s - loss: 0.0021 - mean_absolute_error: 0.0351 - mean_absolute_percentage_error: 3.5148 - val_loss: 0.0023 - val_mean_absolute_error: 0.0348 - val_mean_absolute_percentage_error: 3.4844
Epoch 42/200
 - 10s - loss: 0.0019 - mean_absolute_error: 0.0337 - mean_absolute_percentage_error: 3.3678 - val_loss: 0.0021 - val_mean_absolute_error: 0.0336 - val_mean_absolute_percentage_error: 3.3592
Epoch 43/200
 - 10s - loss: 0.0018 - mean_absolute_error: 0.0323 - mean_absolute_percentage_error: 3.2315 - val_loss: 0.0020 - val_mean_absolute_error: 0.0324 - val_mean_absolute_percentage_error: 3.2408
Epoch 44/200
 - 11s - loss: 0.0017 - mean_absolute_error: 0.0310 - mean_absolute_percentage_error: 3.1046 - val_loss: 0.0019 - val_mean_absolute_error: 0.0313 - val_mean_absolute_percentage_error: 3.1338
Epoch 45/200
 - 11s - loss: 0.0016 - mean_absolute_error: 0.0299 - mean_absolute_percentage_error: 2.9879 - val_loss: 0.0018 - val_mean_absolute_error: 0.0303 - val_mean_absolute_percentage_error: 3.0344
Epoch 46/200
 - 10s - loss: 0.0015 - mean_absolute_error: 0.0288 - mean_absolute_percentage_error: 2.8801 - val_loss: 0.0017 - val_mean_absolute_error: 0.0295 - val_mean_absolute_percentage_error: 2.9464
Epoch 47/200
 - 11s - loss: 0.0014 - mean_absolute_error: 0.0278 - mean_absolute_percentage_error: 2.7813 - val_loss: 0.0016 - val_mean_absolute_error: 0.0287 - val_mean_absolute_percentage_error: 2.8666
Epoch 48/200
 - 10s - loss: 0.0013 - mean_absolute_error: 0.0269 - mean_absolute_percentage_error: 2.6927 - val_loss: 0.0016 - val_mean_absolute_error: 0.0279 - val_mean_absolute_percentage_error: 2.7892
Epoch 49/200
 - 10s - loss: 0.0013 - mean_absolute_error: 0.0261 - mean_absolute_percentage_error: 2.6098 - val_loss: 0.0015 - val_mean_absolute_error: 0.0273 - val_mean_absolute_percentage_error: 2.7258
Epoch 50/200
 - 11s - loss: 0.0012 - mean_absolute_error: 0.0254 - mean_absolute_percentage_error: 2.5364 - val_loss: 0.0015 - val_mean_absolute_error: 0.0267 - val_mean_absolute_percentage_error: 2.6678
Epoch 51/200
 - 11s - loss: 0.0012 - mean_absolute_error: 0.0247 - mean_absolute_percentage_error: 2.4706 - val_loss: 0.0014 - val_mean_absolute_error: 0.0261 - val_mean_absolute_percentage_error: 2.6125
Epoch 52/200
 - 11s - loss: 0.0011 - mean_absolute_error: 0.0241 - mean_absolute_percentage_error: 2.4105 - val_loss: 0.0014 - val_mean_absolute_error: 0.0257 - val_mean_absolute_percentage_error: 2.5668
Epoch 53/200
 - 11s - loss: 0.0011 - mean_absolute_error: 0.0236 - mean_absolute_percentage_error: 2.3568 - val_loss: 0.0014 - val_mean_absolute_error: 0.0253 - val_mean_absolute_percentage_error: 2.5253
Epoch 54/200
 - 11s - loss: 0.0011 - mean_absolute_error: 0.0231 - mean_absolute_percentage_error: 2.3091 - val_loss: 0.0013 - val_mean_absolute_error: 0.0249 - val_mean_absolute_percentage_error: 2.4879
Epoch 55/200
 - 11s - loss: 0.0010 - mean_absolute_error: 0.0227 - mean_absolute_percentage_error: 2.2665 - val_loss: 0.0013 - val_mean_absolute_error: 0.0246 - val_mean_absolute_percentage_error: 2.4553
Epoch 56/200
 - 11s - loss: 0.0010 - mean_absolute_error: 0.0223 - mean_absolute_percentage_error: 2.2289 - val_loss: 0.0013 - val_mean_absolute_error: 0.0243 - val_mean_absolute_percentage_error: 2.4263
Epoch 57/200
 - 10s - loss: 9.9462e-04 - mean_absolute_error: 0.0220 - mean_absolute_percentage_error: 2.1953 - val_loss: 0.0013 - val_mean_absolute_error: 0.0240 - val_mean_absolute_percentage_error: 2.4009
Epoch 58/200
 - 11s - loss: 9.7884e-04 - mean_absolute_error: 0.0217 - mean_absolute_percentage_error: 2.1654 - val_loss: 0.0013 - val_mean_absolute_error: 0.0238 - val_mean_absolute_percentage_error: 2.3790
Epoch 59/200
 - 11s - loss: 9.6547e-04 - mean_absolute_error: 0.0214 - mean_absolute_percentage_error: 2.1399 - val_loss: 0.0013 - val_mean_absolute_error: 0.0236 - val_mean_absolute_percentage_error: 2.3561
Epoch 60/200
 - 11s - loss: 9.5416e-04 - mean_absolute_error: 0.0212 - mean_absolute_percentage_error: 2.1161 - val_loss: 0.0013 - val_mean_absolute_error: 0.0234 - val_mean_absolute_percentage_error: 2.3410
Epoch 61/200
 - 11s - loss: 9.4462e-04 - mean_absolute_error: 0.0210 - mean_absolute_percentage_error: 2.0958 - val_loss: 0.0012 - val_mean_absolute_error: 0.0233 - val_mean_absolute_percentage_error: 2.3269
Epoch 62/200
 - 11s - loss: 9.3658e-04 - mean_absolute_error: 0.0208 - mean_absolute_percentage_error: 2.0784 - val_loss: 0.0012 - val_mean_absolute_error: 0.0231 - val_mean_absolute_percentage_error: 2.3113
Epoch 63/200
 - 11s - loss: 9.2984e-04 - mean_absolute_error: 0.0206 - mean_absolute_percentage_error: 2.0615 - val_loss: 0.0012 - val_mean_absolute_error: 0.0230 - val_mean_absolute_percentage_error: 2.3046
Epoch 64/200
 - 11s - loss: 9.2416e-04 - mean_absolute_error: 0.0205 - mean_absolute_percentage_error: 2.0487 - val_loss: 0.0012 - val_mean_absolute_error: 0.0229 - val_mean_absolute_percentage_error: 2.2941
Epoch 65/200
 - 11s - loss: 9.1938e-04 - mean_absolute_error: 0.0204 - mean_absolute_percentage_error: 2.0365 - val_loss: 0.0012 - val_mean_absolute_error: 0.0228 - val_mean_absolute_percentage_error: 2.2837
Epoch 66/200
 - 10s - loss: 9.1539e-04 - mean_absolute_error: 0.0203 - mean_absolute_percentage_error: 2.0255 - val_loss: 0.0012 - val_mean_absolute_error: 0.0228 - val_mean_absolute_percentage_error: 2.2775
Epoch 67/200
 - 11s - loss: 9.1203e-04 - mean_absolute_error: 0.0202 - mean_absolute_percentage_error: 2.0161 - val_loss: 0.0012 - val_mean_absolute_error: 0.0227 - val_mean_absolute_percentage_error: 2.2730
Epoch 68/200
 - 11s - loss: 9.0921e-04 - mean_absolute_error: 0.0201 - mean_absolute_percentage_error: 2.0092 - val_loss: 0.0012 - val_mean_absolute_error: 0.0226 - val_mean_absolute_percentage_error: 2.2630
Epoch 69/200
 - 10s - loss: 9.0688e-04 - mean_absolute_error: 0.0200 - mean_absolute_percentage_error: 2.0011 - val_loss: 0.0012 - val_mean_absolute_error: 0.0226 - val_mean_absolute_percentage_error: 2.2599
Epoch 70/200
 - 10s - loss: 9.0491e-04 - mean_absolute_error: 0.0200 - mean_absolute_percentage_error: 1.9951 - val_loss: 0.0012 - val_mean_absolute_error: 0.0226 - val_mean_absolute_percentage_error: 2.2561
Epoch 71/200
 - 10s - loss: 9.0328e-04 - mean_absolute_error: 0.0199 - mean_absolute_percentage_error: 1.9894 - val_loss: 0.0012 - val_mean_absolute_error: 0.0225 - val_mean_absolute_percentage_error: 2.2523
Epoch 72/200
 - 11s - loss: 9.0189e-04 - mean_absolute_error: 0.0198 - mean_absolute_percentage_error: 1.9844 - val_loss: 0.0012 - val_mean_absolute_error: 0.0225 - val_mean_absolute_percentage_error: 2.2513
Epoch 73/200
 - 11s - loss: 9.0076e-04 - mean_absolute_error: 0.0198 - mean_absolute_percentage_error: 1.9806 - val_loss: 0.0012 - val_mean_absolute_error: 0.0225 - val_mean_absolute_percentage_error: 2.2465
Epoch 74/200
 - 12s - loss: 8.9980e-04 - mean_absolute_error: 0.0198 - mean_absolute_percentage_error: 1.9764 - val_loss: 0.0012 - val_mean_absolute_error: 0.0224 - val_mean_absolute_percentage_error: 2.2447
Epoch 75/200
 - 11s - loss: 8.9900e-04 - mean_absolute_error: 0.0197 - mean_absolute_percentage_error: 1.9731 - val_loss: 0.0012 - val_mean_absolute_error: 0.0224 - val_mean_absolute_percentage_error: 2.2417
Epoch 76/200
 - 11s - loss: 8.9832e-04 - mean_absolute_error: 0.0197 - mean_absolute_percentage_error: 1.9700 - val_loss: 0.0012 - val_mean_absolute_error: 0.0224 - val_mean_absolute_percentage_error: 2.2400
Epoch 77/200
 - 11s - loss: 8.9776e-04 - mean_absolute_error: 0.0197 - mean_absolute_percentage_error: 1.9675 - val_loss: 0.0012 - val_mean_absolute_error: 0.0224 - val_mean_absolute_percentage_error: 2.2376
Epoch 78/200
 - 10s - loss: 8.9728e-04 - mean_absolute_error: 0.0196 - mean_absolute_percentage_error: 1.9650 - val_loss: 0.0012 - val_mean_absolute_error: 0.0224 - val_mean_absolute_percentage_error: 2.2385
Epoch 79/200
 - 11s - loss: 8.9689e-04 - mean_absolute_error: 0.0196 - mean_absolute_percentage_error: 1.9637 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2335
Epoch 80/200
 - 11s - loss: 8.9656e-04 - mean_absolute_error: 0.0196 - mean_absolute_percentage_error: 1.9612 - val_loss: 0.0012 - val_mean_absolute_error: 0.0224 - val_mean_absolute_percentage_error: 2.2350
Epoch 81/200
 - 11s - loss: 8.9628e-04 - mean_absolute_error: 0.0196 - mean_absolute_percentage_error: 1.9604 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2334
Epoch 82/200
 - 11s - loss: 8.9604e-04 - mean_absolute_error: 0.0196 - mean_absolute_percentage_error: 1.9588 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2321
Epoch 83/200
 - 11s - loss: 8.9584e-04 - mean_absolute_error: 0.0196 - mean_absolute_percentage_error: 1.9575 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2328
Epoch 84/200
 - 10s - loss: 8.9567e-04 - mean_absolute_error: 0.0196 - mean_absolute_percentage_error: 1.9572 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2292
Epoch 85/200
 - 11s - loss: 8.9552e-04 - mean_absolute_error: 0.0196 - mean_absolute_percentage_error: 1.9554 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2316
Epoch 86/200
 - 11s - loss: 8.9540e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9548 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2316
Epoch 87/200
 - 12s - loss: 8.9530e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9544 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2303
Epoch 88/200
 - 12s - loss: 8.9520e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9537 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2297
Epoch 89/200
 - 11s - loss: 8.9512e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9528 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2313
Epoch 90/200
 - 11s - loss: 8.9505e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9529 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2286
Epoch 91/200
 - 11s - loss: 8.9499e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9522 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2273
Epoch 92/200
 - 11s - loss: 8.9493e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9518 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2258
Epoch 93/200
 - 11s - loss: 8.9489e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9508 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2288
Epoch 94/200
 - 11s - loss: 8.9485e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9509 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2284
Epoch 95/200
 - 11s - loss: 8.9481e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9506 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2291
Epoch 96/200
 - 11s - loss: 8.9477e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9503 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2288
Epoch 97/200
 - 11s - loss: 8.9473e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9503 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2278
Epoch 98/200
 - 11s - loss: 8.9470e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9503 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2253
Epoch 99/200
 - 11s - loss: 8.9468e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9499 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2261
Epoch 100/200
 - 11s - loss: 8.9465e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9491 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2287
Epoch 101/200
 - 11s - loss: 8.9463e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9497 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2279
Epoch 102/200
 - 11s - loss: 8.9460e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9492 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2288
Epoch 103/200
 - 11s - loss: 8.9457e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9496 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2266
Epoch 104/200
 - 10s - loss: 8.9455e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9490 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2281
Epoch 105/200
 - 11s - loss: 8.9453e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9493 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2261
Epoch 106/200
 - 11s - loss: 8.9451e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9485 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2293
Epoch 107/200
 - 11s - loss: 8.9449e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9494 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2264
Epoch 108/200
 - 11s - loss: 8.9447e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9488 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2276
Epoch 109/200
 - 11s - loss: 8.9445e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9488 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2283
Epoch 110/200
 - 11s - loss: 8.9444e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9490 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2283
Epoch 111/200
 - 11s - loss: 8.9442e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9489 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2281
Epoch 112/200
 - 11s - loss: 8.9440e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9493 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2261
Epoch 113/200
 - 11s - loss: 8.9438e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9484 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2282
Epoch 114/200
 - 11s - loss: 8.9436e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9490 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2286
Epoch 115/200
 - 12s - loss: 8.9434e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9490 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2277
Epoch 116/200
 - 12s - loss: 8.9432e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9489 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2277
Epoch 117/200
 - 12s - loss: 8.9431e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9486 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2288
Epoch 118/200
 - 11s - loss: 8.9430e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9491 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2286
Epoch 119/200
 - 11s - loss: 8.9427e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9492 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2260
Epoch 120/200
 - 11s - loss: 8.9426e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9486 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2278
Epoch 121/200
 - 11s - loss: 8.9425e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9490 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2280
Epoch 122/200
 - 11s - loss: 8.9423e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9491 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2269
Epoch 123/200
 - 10s - loss: 8.9422e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9489 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2277
Epoch 124/200
 - 10s - loss: 8.9420e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9492 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2279
Epoch 125/200
 - 11s - loss: 8.9418e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9494 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2270
Epoch 126/200
 - 10s - loss: 8.9417e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9487 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2295
Epoch 127/200
 - 11s - loss: 8.9416e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9493 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2291
Epoch 128/200
 - 11s - loss: 8.9414e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9495 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2279
Epoch 129/200
 - 11s - loss: 8.9413e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9494 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2281
Epoch 130/200
 - 10s - loss: 8.9411e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9491 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2294
Epoch 131/200
 - 11s - loss: 8.9410e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9499 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2265
Epoch 132/200
 - 10s - loss: 8.9409e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9492 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2281
Epoch 133/200
 - 11s - loss: 8.9407e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9495 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2289
Epoch 134/200
 - 10s - loss: 8.9406e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9498 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2283
Epoch 135/200
 - 11s - loss: 8.9403e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9491 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2311
Epoch 136/200
 - 10s - loss: 8.9403e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9502 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2271
Epoch 137/200
 - 11s - loss: 8.9402e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9494 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2281
Epoch 138/200
 - 11s - loss: 8.9398e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9495 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2289
Epoch 139/200
 - 12s - loss: 8.9398e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9496 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2306
Epoch 140/200
 - 11s - loss: 8.9398e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9502 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2275
Epoch 141/200
 - 10s - loss: 8.9397e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9494 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2306
Epoch 142/200
 - 11s - loss: 8.9396e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9501 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2298
Epoch 143/200
 - 11s - loss: 8.9394e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9500 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2298
Epoch 144/200
 - 10s - loss: 8.9393e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9502 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2284
Epoch 145/200
 - 11s - loss: 8.9392e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9499 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2292
Epoch 146/200
 - 10s - loss: 8.9391e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9501 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2281
Epoch 147/200
 - 10s - loss: 8.9390e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9498 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2294
Epoch 148/200
 - 11s - loss: 8.9387e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9497 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2317
Epoch 149/200
 - 11s - loss: 8.9388e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9507 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2290
Epoch 150/200
 - 10s - loss: 8.9386e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9502 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2286
Epoch 151/200
 - 11s - loss: 8.9385e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9503 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2280
Epoch 152/200
 - 11s - loss: 8.9384e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9503 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2277
Epoch 153/200
 - 11s - loss: 8.9383e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9500 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2288
Epoch 154/200
 - 11s - loss: 8.9381e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9501 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2305
Epoch 155/200
 - 11s - loss: 8.9381e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9505 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2299
Epoch 156/200
 - 11s - loss: 8.9379e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9507 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2285
Epoch 157/200
 - 11s - loss: 8.9378e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9507 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2270
Epoch 158/200
 - 10s - loss: 8.9377e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9499 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2311
Epoch 159/200
 - 11s - loss: 8.9377e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9508 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2308
Epoch 160/200
 - 11s - loss: 8.9375e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9510 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2282
Epoch 161/200
 - 11s - loss: 8.9375e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9505 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2298
Epoch 162/200
 - 11s - loss: 8.9373e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9509 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2286
Epoch 163/200
 - 11s - loss: 8.9372e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9507 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2287
Epoch 164/200
 - 10s - loss: 8.9371e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9506 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2311
Epoch 165/200
 - 11s - loss: 8.9371e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9510 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2297
Epoch 166/200
 - 11s - loss: 8.9370e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9509 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2303
Epoch 167/200
 - 11s - loss: 8.9368e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9511 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2287
Epoch 168/200
 - 11s - loss: 8.9368e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9509 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2293
Epoch 169/200
 - 11s - loss: 8.9367e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9510 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2301
Epoch 170/200
 - 11s - loss: 8.9366e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9510 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2302
Epoch 171/200
 - 11s - loss: 8.9364e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9515 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2281
Epoch 172/200
 - 12s - loss: 8.9364e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9510 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2290
Epoch 173/200
 - 11s - loss: 8.9363e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9512 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2293
Epoch 174/200
 - 11s - loss: 8.9363e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9511 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2313
Epoch 175/200
 - 11s - loss: 8.9362e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9515 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2303
Epoch 176/200
 - 10s - loss: 8.9361e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9512 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2305
Epoch 177/200
 - 11s - loss: 8.9360e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9513 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2312
Epoch 178/200
 - 11s - loss: 8.9358e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9512 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2322
Epoch 179/200
 - 10s - loss: 8.9359e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9517 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2297
Epoch 180/200
 - 11s - loss: 8.9358e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9513 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2300
Epoch 181/200
 - 11s - loss: 8.9356e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9512 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2321
Epoch 182/200
 - 11s - loss: 8.9355e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9519 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2293
Epoch 183/200
 - 11s - loss: 8.9355e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9511 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2311
Epoch 184/200
 - 11s - loss: 8.9354e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9518 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2308
Epoch 185/200
 - 11s - loss: 8.9353e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9513 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2323
Epoch 186/200
 - 11s - loss: 8.9353e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9516 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2313
Epoch 187/200
 - 11s - loss: 8.9351e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9520 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2296
Epoch 188/200
 - 10s - loss: 8.9351e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9515 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2317
Epoch 189/200
 - 11s - loss: 8.9350e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9522 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2294
Epoch 190/200
 - 11s - loss: 8.9350e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9514 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2311
Epoch 191/200
 - 10s - loss: 8.9348e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9517 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2321
Epoch 192/200
 - 10s - loss: 8.9348e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9521 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2303
Epoch 193/200
 - 10s - loss: 8.9347e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9519 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2296
Epoch 194/200
 - 11s - loss: 8.9347e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9514 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2330
Epoch 195/200
 - 11s - loss: 8.9346e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9523 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2318
Epoch 196/200
 - 13s - loss: 8.9345e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9521 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2316
Epoch 197/200
 - 10s - loss: 8.9345e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9523 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2310
Epoch 198/200
 - 11s - loss: 8.9344e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9522 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2306
Epoch 199/200
 - 11s - loss: 8.9342e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9519 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2322
Epoch 200/200
 - 11s - loss: 8.9342e-04 - mean_absolute_error: 0.0195 - mean_absolute_percentage_error: 1.9524 - val_loss: 0.0012 - val_mean_absolute_error: 0.0223 - val_mean_absolute_percentage_error: 2.2303
Out[16]:
<keras.callbacks.History at 0x24ccc707320>
In [17]:
trainPredict = model.predict(trainX)
testPredict = model.predict(testX)
In [18]:
print((mean_absolute_error(trainY, trainPredict))/(sum(trainY)/len(trainY)))
print((mean_absolute_error(testY, testPredict))/(sum(testY)/len(testY)))
0.11922007855268595
0.119943823828164

Assignments

In [0]:
'''
Task 1: Incorporate Fourier features as features into Regression models and measure MAPE. <br>

Task 2: Perform hyper-parameter tuning for Regression models.
        2a. Linear Regression: Grid Search
        2b. Random Forest: Random Search 
        2c. Xgboost: Random Search
Task 3: Explore more time-series features using Google search/Quora/Stackoverflow
to reduce the MAPE to < 12%
'''
Out[0]:
'\nTask 1: Incorporate Fourier features as features into Regression models and measure MAPE. <br>\n\nTask 2: Perform hyper-parameter tuning for Regression models.\n        2a. Linenar Regression: Grid Search\n        2b. Random Forest: Random Search \n        2c. Xgboost: Random Search\nTask 3: Explore more time-series features using Google search/Quora/Stackoverflow\nto reduce the MPAE to < 12%\n'